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ill 



Abstract 



Four writing samples were obtained from 638 applicants for admission to 
U.S. institutions as undergraduates or as graduate students in business, 
engineering, or social science. The applicants represented three major 
foreign language groups (Arabic, Chinese, and Spanish), plus a small sample of 
native English speakers. Two of the writing topics were of the compare and 
contrast type and the other two involved chart and graph interpretation. The 
writing samples were scored by 23 readers who are English as a second language 
specialists and 23 readers who are English writing experts. Each of the four 
writing samples was scored holistically , and during a separate rating sassion 
two of the samples from each student were assigned separate scores for 
sentence-level and discourse-level skills. Representative subsamples of the 
papers also were scored descriptively with the Writer's Workbench computer 
program and by graduate-level subject matter professors in engineering and the 
social sciences. 



In addition to the writing sample scores, TOEFL scores were obtained for 
all students in the foreign sample. ORE General Test scores were obtained for 
students in the U.S. sample and for a subsample of students in the foreign 
sample. Students in the U.S. sample also took a multiple-choice measure of 
writing ability. 

Among the key findings were the following: 1) holistic scores, discourse- 
level scores, and sentence-level scores were so closely related that the 
holistic score alone should be sufficient; 2) correlations among topics were 
as high across topic types as wirhin topic types; 3) scores of ESL raters, 
English raters, and subject matter raters were all highly correlated, 
suggesting substantial agreement in the standards used; correlations and 
factor analyses indicated that scores on the writing samples and TOEFL were 
highly related, but that each also was reliably measuring some aspect of 
English language proficiency that was not assessed by the other; and (5) 
correlations of holistic writing sample scores with scores on item types 
within the sections of the GRE General Test ielded a pattern of relationships 
that was consistent with the relationships reported in other GRE studies. 
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!• INTRODUCTION 



The ability to write clearly Is an essentia!' skill needed by under- 
graduate and graduate student^* With the recognition that too many 
students pass through our educational system with only minimal English 
language competence, educators are reapprrlslng their methods and 
redefining their objectives* Writing competence » in particular » is being 
addressed as a skill that is Integral to effective communication* There- 
fore, researchers and educators recently have dlrecfed considerable effort 
toward the measurement of writing ability, and, in turn, to the 
understanding of its relationship to other cognitive skills. In the past, 
measurement of writing skills has been achieved largely by means of 
indirect measures — test items cast in the multiple-cholce format. However, 
the definition of writing competence currently is being expanded and 
refined* Although multiple-choice measures provide some indicators of 
written language skills, they do so indirectly, in that students respond to 
writing tests by recognizing a correct answer among a finite set of 
alternatives* Because the act of writing invo.^ves the production of a 
written piece, act.^al writing samples, or direct measures of writing, now 
are viewed as a more appropriate means for assessing writing performance 
because they more nearly approximate real discourse* 

Two major testing programs at Educational Testing Service (ETS), the 
Test of English as a Foreign Language (TOEFL), and the Graduate Record 
Examinations (GRE), provide scores on multiple-choice measures that 
contribute to decisions made during the postsecondary admission process* 
The purpose of this study was to determine the relationship of scores on 
direct and Indirect measures of writing ability to scores on the lOEFL and 
the GRE General Test* This project is a response by ETS to the assessment 
concerns expressed by educators in the field — the examination of direct 
methods for evaluating writing skills and the relationship of these 
measures to other, more conventional measures ol: developed abllltlt^s* 

The TOEFL was designed to assist an institution in determining whetuer 
a foreign af>pllcant for whom English is a second language has attained 
sufficient p of Iclency in English to study at that institution, at either 
the undergraduate or graduate level* An important component of that 
general proficiency is the ability to communicate in written English* In 
the TOEFL examination, the Structure and Written Expression section 
(Section 2) is an indirect measure of wilting ability* The GRE General 
Testf as one indicator of potential for graduate study, serves as an 
Instrument for admission to graduate-level education for applicants who are 
either native or nonnatlve speakers of English* The GRE General Test 
provides scores that are Intended to assess developed abilities in the 
verbal, quantitative, and analytical reasoning domains, but does not 
contain an indirect measure of writing ability as such* Thus the TOEFL and 
GRE General Test provide complementary functions with respect to the 
admission of foreign students to graduate programs* 

A recent informal survey of professionals in the field of English as a 
Second Language (ESL) conducted by Hale and Kinofotls (1981) identified the 
measurement ot •'oductlve skills (e*g*, speaking and writing) as highly 
desirable for preadmission testing and placement decisions* A report by 
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elicit performance of the skills the samples are Intended to measure* 
Moreover y since Plke*s study Involved the relationships between perf or t.^^nce 
on writing vamples and the old form of TOEFL, these relationships should be 
reexamined foe the new form* 

The cr:rrent project Investigated the relationship of scores from a 
current TOEFL form and from the GRE General Test with scores on writing 
samples reflecting the kind of performance that would be required of 
beginning undergraduate and graduate students in the three fields enrolling 
the largest numbers of foreign students— business ^ engineering , and the 
social sciences* The study builds on the information obtained in a 
previous TOEFL project , a survey of academic writing tasks (Brldgenian & 
Carlson, )983) that investigated the kinds of writing skills required of 
students across different departments in UniteJ. States and Can&dlan 
institutions of higher education* The results of the 1983 study are 
summarized in a subsequent section of this report that brings together 
other research findings in the realm of communicative competencies* An 
additional objective of the TOEFL writing survey was to design a study that 
would relate TOEFL scores to scores on student writing samples , using 
appropriate topics identified in the survey* The study focuses on 
nonnative students who take the TOEFL as part of the admission process for 
entrance into United States and Canadian institutions* A logical extension 
of this work Includes GRE General Test scores for nonnative as well as 
native speakers of English* Contrasting the correlational patterns for 
nonnative speakers in various academic disciplineo with those for native 
speakers within and across disciplines provides information that is useful 
foi.' admission and placement of both native and nonnative speakers and for 
more meaningful interpretation and use of GRE General Test and TOEFL 
scores* 



Foundations for the Design and Implementation of the Study — 
Research, Theory, and Practice 

The primary purpose of this study was to determine the relationships of 
TOEFL and GRE General Test scores to the kinds of writing tasks that 
first-year students a^re expected to perform* These data provide Important 
information regarding the construct validity of the GRE General Test and 
the TOEFL, information that should be useful to those who Interpret the 
scoies on these tests as well as to ETS test developers who may be 
coiislderlng the addition of direct measures of writing ability, in the case 
of the TOEFL, and of indirect and direct measures of writing ability for 
future GRE General Test forms* The study Involved the collection of fou/ 



According to the 1980~'81 survey conducted by the Institute of 
International Education (Boy?*n, 1981), approximately 26 percent of the 
foreign students in the United States are enrolled in engineering 
programs, 17 percent in business and management, and 8 percent in the 
social sciences. 
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writing samples from native and normative speakers of English who were 
seeking admission to undergraduate and graduate levels of education in the 
United States and Canada. In addition, recent GRE General Test and TCEFL 
scores were obtained for the appropriate groups of candidates (e.g., 
candidates for admission to undergraduate programs do not take the GRE). 
The TOEFL scores include an indirect measure of writing skill, the 
Structure and Written Expression section of the TOEFL; scores on a 
comparable indirect measure (a section of a retired form of the LSAT) were 
obtained for native-speaking GRE candidates. The standardized test scores 
then vTere related to holistic and analytic scores on the writing samples. 
The plans for the data collection procedures are described in the final 
section of this chapter, and the specific procedures that were implemented 
appear in subsequent chapters. Before the implementation of the study is 
presented in detail, however, the rationales for the design of this complex 
project are explained in this section. 

The most significant and fundamental tasks for this research required 
(1) the aesign of writing assessment instruments and (2) the collection and 
£>coring of writing samples with these instruments* Elaborate planning was 
necessary, since the validity and usefulness of the information gained by 
the data analyses would depead on the quality of the measurement process. 
To achieve the best and most appropriate assessment of writing skills, the 
study design took into account the numerous perspectives that the state of 
the art in the evaluation of writing ability has to offer. We combined the 
knowledge and experience accumulated by a variety of disciplines — writing 
assessment and instruction, psychological measurement, linguistics, 
contrast ive rhetoric, and instruction in English as a second language 
(ESL). Each of these fields offers insights garnered from theory, 
research, and practice. Our first planning objective focused on the 
definition of competence in writing, a definition that emphasizes the 
situational context of writing assessment appropriate to the objectifies of 
the TOEFL and the GRE General Test as indicators of a student's ability to 
write English. This definition was formulated on the basis of information 
drawn from the areas of writing assessment, communicative competency, and 
contrastive rhetoric. Our second planning objective required the design of 
a validation study that depended on the development of effective instru* 
ments to evaluate written competence and on rigorously implemented data 
collection and scoring procedures. The following section briefly 
sumiuarizes the framework for formulating a functional definition of writing 
ability, including our survey of academic writing skills. The 
subsequent section describes the bases for the design of the validation 
study. 

A Definition of Writing Competence 

The term "measure** suggests the ability to assign a value, or number, to 
what is being evaluated. In any form of writing assessment, that measure is 
subject to error, since it is based on inferential judgments with respect to 
standards that define competent writing. Th^ definition of w^at we are 
seeking to measure is achieved by circumscribing the characteristics of 
writing abiliry, given the limitations of the state of the art in the 
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measurement of written responses produced by Individuals. In writing 
assessment, experts In the field still are attempting to develop an objective 
definition of competent writing* It in Important to recognize that competent 
writing iw onstructy or concept, that requires careful definition In order 
to be measured* In addition, the definition of this construct may vary from 
Instance to Instance, In that competent writing Is situational — It Is defined 
by the specific task demands within the particular situation In which, and for 
which, writing ability Is being assessed* Vheu writing ability Is evaluated, 
that ability Itself Is not measured directly, but rather, assessed on the 
basis of Inferences drawn from an Individual's performance* 

As we sought to develop a working definition of writing competence In the 
context of the GRE General Test and TOEFL> we drew on six perspectives, •s 
described In the following sections: the new paradigm for wilting assessment, 
functionally based communicative competencies, f leld-speclf Ic writing task 
demands, the TOEFL survey of academic writing tasks, a theoretical perspective 
of functional communicative competency, and perspectives from contrastlve 
rhetoric* 

The New Paradigm for Writing Instruction and Assessment 

One leader In the field of writing assessment, Odell (1981), recently 
redefined writing competence *** * * to mean the ability to discover what one 
wishes to say and to convey one's message through language, syntax, and 
content that are appropriate for one's audience and purpose" (p* 103)* In the 
direct assessment of writing, the writer Is presented with some form of 
written communication that designates the task(s) to be accomplished* This 
communication varies in the degree to which the specific demands of a 
particular task are described* Depending on the amount and kinds of 
information provided, the verbal statements the writer also communicate 
expectations about performance; in turn, these statements reflect, in varying 
degrees, the standards or criteria that will be applied in the evaluation of 
the written product* 

Because the characteristics that contribute to competent writing are 
situationally dependent, the elements of the writing task presented should be 
predicated on a definition of writing competence that is directly parallel to 
the specific objectives for evaluating writing within a specific situational 
context* These objectives and the context of the evaluation must be 
described and, subsequently, reflected by the design of the writing assessment 
measure* Since the present research was conducted under the auspices of 
testing programs that serve as preliminary Ucators of a candidate's 
readiness to parti. Ipate successfully in an Engllsh**based curriculum at the 
tmdergraduate and graduate levels of education, we sought to define writing 
competence from the standpoint of the objectives of these tests — the stand- 
point of functional communicative competf-ncy* 

Functionally Based Communicative Competencies 

Linguists who have investigated the dimensions of language teaching and 
testing (Canale, 1983; Canale & Swain, 1979; Munby, 1978; Walz, 1982) 
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emphaslze the approach of "functionally based romraunlcatlve competency." 
Briefly defined. It entails the ability to use language to communicate 
effectively within the specific context In which the communication takes 
place; It Is "functional," In that It "works," serving to convey what the 
person Intended and resulting In appropriate receptive behavior (thought or 
action) by the recipient of the communication. This functional orientation 
provides an explanation for the observed discrepancies between knowledge of 
grammar jand conventions and actual production on direct measures of writing 
skills. * 



Fleld-Speclflc Writing Task Demands 

Other researchers have focused their Investigations concerned with 
functionally based academic writing task demands on f leld-speclflc 
requirements, with emphasis on English for specific purposes. For example, 
West and Byrd (1982) surveyed 25 engineering faculty rembers at the University 
of Florida to Identify the kinds of writing assigned to graduate students 
during one academic year (1979-80). West (1982) also surveyed 33 engineering 
faculty members during the same year, asking them to rate American and foreign 
students on eight writing dimensions. These faculty ranked the performance of 
all foreign graduate students lower than the performance of American students 
on all the writing dimensions, except for quality of content. Making palrwlse 
comparisons on the eight dimensions of foreign student writing. West ordered 
the dimensions from weakest to strongest as follows: (1) correctness of 
punctuation, (2) quality of sentence structure, (3) vocabulary size, (4) 
correctness of vocabulary usage, (5) quality of paragraph organization, (6) 
quality of overall paper organization, (7) quality of content, and (8) ov-rall 
writing ability. We adapted these dimensions for use In our TOEFL survey of 
academic writing tasks (Brldgeman & Carlson, 1983), described In the next 
section. 



T u ^"/fSonx®'^ ^^"^^^ ^^^^ typifies research In writing for academic purposes, 
Johns (1980) focused on the cohesive elements In written business discourse. 
Hill, S'>ppelsa, and West (1982), stressing the academic need for ESL students 
to learn to write experimental research papers, outlined an Instructional 
approach that similarly alms at functional discourse. Pointing to the growing 
Interest In English for specific purposes and In English for academic 
purposes, these researchers Identified experimental research papers as 
Important to academic and professional success In the sclencos and social 
sciences. Another ESL Instructional approach recently described by Spack and 
Sadow (1983) emphasizes the composing process and writing assignments that 
students will face In academic and professional situations. 



Recently a number of researchers have attempted to Identify some of the 
writing tasks that are required of graduate and undergraduate students 
within functional contexts: Freedman (1979), Johns (1981), K'-oll (1979), 
Ostler (1981), Weaver (1982). Their findings, which Identified writing* 
task demands within the contexts of their specific Institutions, are 
described in a report of our previous research (Brldgeman & Carlson. 
1983). 
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The TOEFL Survey of Academic Writing Tasks 

The literature on functional communicative competency served as the basis 
for the design of a research project that would provide a definition of 
writing task demands In postsecondary academic settings* The primary 
obje^'tlve of this project (Brldgeman and Carlson, 1983) was to Identify and 
describe operationally the expectations of writing competence required of 
nonnatlve speakers of English at the beginning of their educational 
experiences In Institutions of higher education In the United States and 
Canada* The Information we gathered took Into account the various factors 
that should be considered In defining communicative competence In wrltliig^-the 
functional task demands for which students are expected to be prepared, as 
well as the perceptions, sometimes culturally Influenced, of those who 
evaluate them* Initially, Informal Interviews and the literature provided the 
basis for the design of a survey Instrument that Incorporated the full range 
of expectations of writing competence* The writing task demands, features of 
writing tasks (adapting West's dimensions of student writing), and types of 
writing sample topics were expressed In terminology that would communicate 
clearly to Individuals In various disciplines* Subsequently, a representative 
sample of departments within Institutions responded to the questionnaire » 
providing a basis for describing the domain of writing compfitencles expected 
of entering native and nonnatlve students* 

The survey questionnaire was completed by faculty In 190 academic 
departments at 3A universities In the United States and Canada with high 
foreign student enrollments* At the graduate le* /"l, six academic disciplines 
with relatively high numbers of nonnatlve students were surveyed: ?3uslness 
management (MBA), civil engineering, electrical engineering, psychology, 
chemistry, and computer science* Undergraduate English departments were 
chosen to document the skills needed by undergraduate students* 

The major findings are summarized as follows: 

o Although writing skill was rated as Important to success in 
graduate training, it was consistently rated as even more 
Important to success after graduation* 

o Even disciplines with relatively light writing requirements 
(e.g.» electrical engineering) reported that some writing is 
required of first-year students* Lab reports and brief article 
summaries are common writing assignments in engineering and the 
sciences* Longer research papers are commonly assigned to 
undergraduates and to graduate students in MBA, civil engineer- 
ings and psychology programs* 

o Descriptive skills (e*g., describe apparatus, describe a 

procedure) are considered important in engineering, computer 
science, and psychology* In contrast, skill in arguing for a 
particular position is seen as very important for undergraduates, 
MBA students » and psychology majors, but of very limited 
importance in engineering, computer science, and chemistry. 
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Faculty members reported that, In their evaluations of student 
writing, they rely more on discourse-level characteristics (e.g., 
organization o^: Ideas, quality of content) than on word- or 
sentence-level characteristics (e.g., punctuation/spelling, 
sentence structure, vocabulary size)* 

Discourse-level writing skills of natives and nonnattves are 
perceived as fairly similar, but significant differences between 
natives and nonnatlves were reported for sentence- and word-level 
skills and for overall writing. A majority of departments re- 
portedly use the same standards for evaluating the writing of 
native and nonnatlve students, although nearly a third of the 
departments reportedly use different standards. 

Respondents were asked to rate types of writing sample topics, 
to Indicate their preference for topics that would most likely 
ellclc evidence of the writing skills that would facilitate 
performance In academic contexts* (Two examples of each type 
were provided •) The 10 topic types represented a range of 
writing assignments: (A) personal essay, (B) sequential or 
chronological description, (C) spatial or functional description, 
(D) compare and contrast, (E) compare and contrast plus take a 
position, (F) extrapolation, (G) argumentation with audience 
designation, (H) describe and li terpret a graph or chart, (I) 
summarize a passage, and (J) summarize a passage and 
analyze/assess the point of view. The clear favorite among the 
engineering and science departments uls Topic H (describe and 
Interpret a graph or chart). However, this topic was perceived as 
Inappropriate by a majority of the undergraduate English faculty. 
Topic G (argumentation with audience designation) was the 
favorite among MBA programs; Type E (compare and contrast plus 
take a position) also was evaluated positively by the MBA 
programs and was the favorite among undergraduate English 
faculty. 

To obtain a summary picture of the relationships among topic types 
both within and between academic disciplines, the acceptability 
ratings were analyzed using a Jiultldlmenslonal scaling approach that 
accommodates differences between raters. Within each discipline, 
the pattern of responses to each topic type was compared to the 
pattern of responses for every other topic type. The positions of 
the topic types, as rated by the respondents, reflect the 
perceptions of the similarities and differences among the topic 
types. The mulltldlmenslonal scaling suggested that the respondents 
reacted to the topic types as having two dimensions, one determined 
by the complexity of the task demanded by the topic type, and the 
other, by the degree of personal involvement required. Topic H can 
then be seen as a relatively simple and Impersonal task. Topic E Is 
a little above average on the complexity dimension and Is a task 
requiring a relatively high degree of personal Involvement In the 
toplCc 
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In 8uni» the faculty members surveyed appeared to view student wrlulng 
skills from the standpoint of functional communicative competencies* For 
example, the written i^rodccts prepared by students In different disciplines 
may be considered competent to the extent that they meet the task demands— 
particularly kinds of writing assignments and certain skills— that are 
specific to a discipline* In addition, faculty members reported that written 
assignments were evaluated on the basis of dlscourse^level characteristics, 
rather than word- or sentence-level characteristics » and that they perceived 
the discourse-lev .Ing skills of natives and nonnatlves to be fairly 

similar* Grammatlc competency, however, tends to Influence evaluations of 
student writing to some extent, since respondents reported that nonnatlves are 
more deficient In word- and sentence-level skills than are natives* 

A Theoretical Perspective of Functional Conmunlcatlve Competency 

Our effort to define academic writing tasks required of entry-level 
students In postsecondary Institutions also was based on the theoretical 
Insights of Canale and Swain (1979; Canale, 1983)* Canale proposes a 
framework that distinguishes three types of language proficiency: basic, 
communicative, and autonomous* He believes that the most fundamental 
problem In language assessment results from the lack of an adequate 
theoretical framework for language proficiency* He summarizes the recent 
work by Bruner and CuimnJas regarding language proficiency and poses a 
framework that builds on their work, with modifications* Cummins (1983) 
provides a revision a:id clarification that is more directly applicable to 
language proficiency, in that language tasks are classified into four 
primary groups: cognitively demanding/cognitively undemanding and context- 
embedded/context-^.educed* The context continuum for the classification of 
tasks ranges from context-embedded, which involves a "shared reality" or 
common world knowledge, to context-reduced tasks that -* * * require 
greater reliance on linguistic cues to meaning and on the propositlonal and 
logical structure of the information Involved rather than on shared (or 
even existing) reality** (p* 337)* The cognitive continuum ranges from 
tasks demanding little active cognitive Involvement to tasks demanding much 
active, complex cognitive processing* This representation of language 
tasks clearly resembles the two-dimensional representation of writing; topic 
types as perceived by academic respondents to our TOEFL survey of jademic 
writing skills, in which the ordering of the ratings of topic typ s suggest 
two dimensions — cognitive complexity and personal Involvement* 

The perspective of functional communicative competency, in combination 
with other theoretical insights and research findings reported in this 
chapter, contributed several propositions that were the basis for design of 
this writing assessment research* The propositions are the following: 

o Performance, '•'hich can be assessed in various ways, serves as an 
Indirect means for evaluating language proficiency* The kind and 
degree of language proficiency being measured by a specific task 
are determined by the nature of that task* 
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o To evaluate performance on a task, the dimensions of that task, 
which condition the performance elicited, must be specified 
clearly. 

o The following elements of a task that will be used to infer kinds 
and degrees of language proficiency must be accurately described, 
to the extent possible, both to the individual whose performance 
is being assessed and to the individual(s) who will evaluate that 
performance : 

The nature of the task demands, in terms of cognitive 
complexity and degree of personal involvement required. 

The nature of the linguistic performance that is expected to 
be elicited by the specific task, with the reservation that 
the linguistic performance that will be observed is what the 
examinee has produced within a specific context. That 
linguistic performance cannot necessarily be generalized to 
an evaluation of overall linguistic performance in (in this 
case) the written mode of communication. 

The nature of the hypothesized communicative situation in 
which the task places the examinee; e.g., the btated or 
implied purpose and audience to be addressed. 

The nature of the testing situation and all aspects of that 
social context that might influence differentially 
performance on the task; e.g., time limitations that do not 
allow for full organization and revision, the score on the 
task as one determinant of admission to an institution. 

The methods and procedures used for assigning scores to 
performance, which provide reasonable restrictions on score 
interpretation. The scoring method, for example, should 
reflect the scorers* appreciation of the dimensions of the 
task, the task demands, and the specific performance features 
that can be validly evaluated. 

Canale (1983) proposes that the general framework originally posed by 
Canale and Swain (1979) with reference to commuricative language also would 
be useful to other approaches to language. This distinction between 
communicative competence and performance is essential to the evaluation of 
language proficiency. 

Perspectives from Contrastiv^ Rhetoric 

Another area of research that has explored che academic task demands 
required of nonnative speakers of English has been termed "contrastive 
rhetoric." In this area, rhetorical patterns across cultures are 
identified and compared (Kaplan, 1972, 1976, 1977, 1982). The results of 
studies of contrastive rhetoric provide somewhf^t mixed evidence, some 
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rejectlng and others supporting the underlying assumption that the 
structural differences between the native language and the foreign language 
may Interfere with the learning of the foreign language* We reviewed 
several representative papers In this area In order to tcke cultural 
differences Into account* 

The work of Buckingham (1979), Llndstrom (1981), Pearson (1981), 
Takala, Purves, and Buckmaster (1982), and Purves (1984) particularly 
Informed the process of topic selection and training of readers* The 
perspectives of cultural relativity provide a framework that Influenced our 
decisions regarding the design of writing assessment tasks, the scoring of 
the collected writing samples, and the Interpretation of results* Cultural 
differences In response to the demands cf a writing task were taken Into 
account as we attempted to Identify and control the various parameters 
Influencing the assessment of the writing performance of students from 
different International cultures* These parameters are described Ir the 
following section* 



The design of a writing assessment program Is Influenced by practical 
considerations such as costs and staffing; whatever the limitations Imposed 
for the sake of efficiency, the Interpretation of the results of any 
writing assessment must be conditioned by the factors that may have 
contributed to the results* Some of these parameters of a writing 
assessment program can be controlled, or accounted for, by good advance 
planning; others that cannot be controlled should at least be recognized as 
exerting possible effects on the outcomes of the assessment* The design of 
an investigation based on samples of writing ability requires the Implemen- 
tation of carefully planned procedures* As proposed, this validation study 
was executed In a series of stages: Instrument development, administration 
of experimental tests, scoring of direct assessment Instruments, scoring of 
other Instruments, and analyses of data* These stages, summarized in 
subsequent chapters of this report, are briefly described here* 



As we became Involved In the development of measures for the direct 
assessment of writing performance, we considered the additional 
perspectives afforded by practice, research, and current theory regarding 
design of writing prompts* A considerable amount of literature Is devoted 
to the design of writing test prompts, as summarized by Ruth (1982)* At 
ETS, another source of knowledge for this study was the experience of 
practitioners. who have conducted large-scale writing assessment programs* 
This expertise represents the state of the art In the design of writing 
assessment tasks; however, much research remains to be conducted regarding 
to what extent the parameters of a writing assessment Instrument Influence 
writing performance. 



Design of the Writing Assessment Validation Study 



Instrument Development 
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The writing stimulus, or the verbal statement that elicits the specific 
writing performance being targeted, requires careful development and 
pretesting* Pretesting of the writing stimulus Is essential—topics may 
appear superficially to achieve the desired results, but actual writing 
samples obtained from a representative population of students may yield 
surprising Information about how the topic Is perceived and the nature of 
the responses that are produced. Pretesting In this Instance Influenced 
our judgments about how well the following objectives were being met: 

o The mode of discourse or type of writing assignment that the 

task presents (e.g., personal essay, persuasive argument) should 
reflect the expectations of writing required in undergraduate and 
graduate academic work in the United States. 

0 The writing tasks should avoid content with cultural bias, 
culture-bound vocabulary and concepts that might penalize a 
nonnative speaker, as well as topics that evoke heavily emotion- 
laden responses. 

o The statement of the task should clearly communicate the 

expectations of writing performance demanded by the writing 
stimuli. 

o The expectations for writing performance should be reasonable, 
given time constraints. 

o Students would be asked to write on all four topics to elicit 
equivalent, comparable performances and to avoid eliciting 
differential performance within modes in response to different 
topics, which would invalidate the assessment. 

Our survey of academic writing tasks provided the basis for the 
development of writing assessment instruments. The survey enabled us to 
define writing competence functionally in terms of the writing tasks that 
beginning postsecondary students would be expected to perform and the 
measurement objectives of the TOEFL and GRE General Test. In addition, the 
survey guided us in the selection and implementation of the parameters 
influencing the measurement of writing skills, such as specific approaches 
to scoring, that were critical to this writing assessment data collection. 

The survey indicated that no single essay topic type was universally 
accepted by all the academic disciplines surveyed. In the multidimensional 
scaling. Types H and E were fur her apart in the space than any other pair 
of types, suggesting that they were percelvi 1 as distinctly different 
tasks. Thus the Type E topic type was selected to serve as an effective 
contrast to the Type H topic type; since departments perceived these two 



2 

A full discussion of these task factors will appear in the chapter, 
"Testing ESL Student Writers" (Carlson and Bridgeman, in press). 
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types as distinctly different, it seemed likely that writing samples 
elicited by Types H and £ elicited different writing skills, as well* 

For this project, we proposed to develop two topics of Type H and two 
topics of Type E to which each student would respond* To administer topics 
that would most effectively meet the measurement objectives of the study , 
several topics of each type were developed and pretested* The pretesting 
allowed us to Judge the topics in relation to the criteria discussed in 
this chapter* 

Administration Factors 

Major factors in test administration that contribute to the outcomes of 
writing assessment were taken into consideration: 

o The physical layout of the writing stimulus was designed to give 
writers the opportunity for prewriting tasks of planning and 
organization and to suggest the expected length of the writing 
sample* 

o Directions to adminstrators were designed to minimize such 
adverse conditions in the testing room as uncomfortable 
temperature, poor lighting, noise, and poor writing surface* 

These factors are critical to any testing situation but assume greater 
importance when students are asked to generate and produce written 
responses. 

Data Collection 

Most of the data collection procedures that we had proposed to carr> 
out were accomplished, with the exception or a few practical modifications* 

Sample 

We obtained a total sample of candidates for undergraduate and graduate 
study representing three language groups (Spanish, Arabic, and Chinese ) 
plus a group of native-English-speaking graduate students from the United 



Most Chinese TOEFL candidates are from Taiwan, but few undergraduates 
are tested in Taiwan* Large numbers of Chinese candidates for 
admission as undergraduates come from Hong Kong* Thus, we 
anticipated that Chinese graduate candidates would be drawn from 
test sites in Taiwan, and undergraduates from Hong Kong* This would 
provide for the greatest generalizabilitv of the results to the 
actual TOEFL population* However , given the known differences 
between education in Taiwan and Hong Kong, the confounding of 
location with undergraduate status must be considered when the 
resultc are interpreted* 
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States. The group of students applying for admission at the graduate level 
was to be further subdivided Into three major field c&tegorles: business, 
"hard" science, and social sclence'^/humanltlea. Some subsamples presumably 
are of greater Interest to the TOEFL program, while other subsamples are of 
iLjre Interest to the GRE program. The sample that was obtained, however, 
contained few business majors; these candidates were Includes In the social 
science/humanities classification, resulting in two major field categories* 

We proposed to test samples of approximately 270 students from each 
language group, 70-75 candidates for admission as undergraduates and 
200-210 candidates for admission as graduate students. More students were 
needed in the graduate category because this group would be divided and 
analyzed separately according to academic majors; undergraduate students 
would be treated as a single group for analysis purposes • As desjcrlbed in 
the chapter on data collection, the actual total sample (662) obtained was 
smaller than anticipated; however, the sizes of the total sample and native 
language subgroups were sufficient for the statistical analyses. The 
sample sizes varied, depending on the amount of missing and complete data, 
for €. ^.h of the several analyses. The detailed descriptions of the 
resulting data collection and analyses are reported in subsequent chapters. 

The GRE/TOEFL group Included candidates for admission ad graduate 
students who had taken (or planned to take) both the TOEFL and GRE 
examinations. The TOEFL -only group included foreign candidates for 
admission to institution, in the United States as undergraduate or graduate 
students, ine GRE-only group Included native-English-speaking candidates 
who were candidates for graduate admissions to institutions in the United 
States. GRE scores were obtained for native-English-speaking candidates, 
whereas both TOEFL ant "SE scores were obtained for candidates for 
admissions in both the domestic and foreign samples who had taken the TOEFL 
as well as the GRE. Native students resnonded to the writing assessment 
Instruments at universities in the United States; nonnatlvts students 
responded on the day on which they took the TOEFL at international test 
centers. 

Testing procedures 

Because language skills can change dramatically in a relatively short 
period of time, testing students in the United States some months after 
they took the TOEFL in their native countries might lead to inexplicable 
confounding and uninterpretable results. Instead, we tested students at 
foreign centers as close in time as possible to when they took the TOEFL. 
GRE scores should be less subject to short-term fluctuations, and any 
student who had taken f ne GRE op to six months before the TOEFL or who was 
scheduled to take the GRE up to s.^x months after the TOEFL was eligible for 
Inclusion in the sample. 

The international centers were selected, with the assistance of program 
staff, based on the following criteria: having candidates from the desired 
language groups, having candidates representing diverse ability levels, 
having a reasonable balance of undergraduate and graduate candidates, and 
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having substantial, numbers of GRE (recent past or potential) candidates. 
The procedures for selecting and inviting the candidates varied, depending 
on the specific conditions at each test site, as described in Chapter III. 

Writing samples from GRE candidates in the domestic sample were 
collected during special testing sessions at five major university testing 
centers ai^er we had identified and selected recent GRE General Test 
takers. Since the GRE General Test does not contain an indirect measure of 
writing skills, the GRE candidates at domestic sites also took a brief 
objective test of writing skills, a retired form of a test of writing 
skills formerly used by the Law School Admission Teating program. Thus we 
were able to compare indirect measures with direct measures of writing for 
the native GRE candidates, as well as for the nonnative TOEFL and GRE 
candidates. 

As proposed, each native and nonnative English-speaking candidate 
produced four writing samples, two samples per topic type. We collected 
this number of samples in order to elicit a reasonable representation of 
writing skills, as well as an indication of the degree of consistency in 
the performance of individuals across similar and different tasks. We 
recognized that the samples ideally should be obtained at more than one 
sitting, to avoid fatigue and uniform responses that students might show ^ 
because tasks are ccasecutive (Diederich, French, 4 Carlton, 1961; Godshalk 
et al., 1966)c The one-day testing situation was unavoidable, however, 
because of the logistics (and subsequent attrition) involved in asking 
students at the international testing centers to return on another day. 
Thus the writing sample topics were designed to be sufficiently different 
to discourage mechanical responding. The distinctly different topic types 
also were expected to elicit different writing skills, particularly since 
the task requirements were to be carefully phrased to emphasize their 
different expectations. 

Scoring of the Instruments for the Direct Assessment of Writing 
Scoring methods 

Selection of an appropriate scoring method for a writing sample depends 
on the purposes of the assessment. A holistic evaluation (i.e., a single 
score representing the overall impression created by the sample) may be 
more efficient for making selection or placement decislona, whereas a more 
analytic framework (i.e., separate scores for a number of different 
organizational and grammatical features of the sample) may be more useful 
for oroviding diagnostic information to teachers. Although other methods 
(e.g., error counts) may yield more objective scores as a rough index of 
second language proficiency, they may be poor indicators of functional 
communicative competence. 

Holistic scoring is impressionistic, but it is not haphazard. 
Considerable care must go into selecting sample essays (range finders) that 
represent each point on the score scale, and thorough training of the 
readers is necessary. Such training involves discussion among the readers 
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to reach consensus on the criteria. During a reading session, continual 
checks must be made to ensure that no reader Is straying from the standards 
originally set. Since the scorer judgments are subjective » each essay 
should receive at least two Independent readings. The scores from the two 
readers are typically added together to form the single holistic score. 

Holistic evaluations may be Influenced by a number of features of an 
essay I Including content, organization, sentence structure, and mechanics. 
A study by Freedman (1979), In which essays were rewritten so that they 
exhibited strengths or weaknesses on each of the preceding four traits, 
Indicated that content and organization had the greatest Influence on 
holistic scores. Mechanics and sentence structure Influenced scores only 
If the essay was well organized. However, generalizing from studies based 
on essays written by native speakers to essays written by ESL students may 
be unwarranted. Breland and Jones (1982) used a set of 20 scores 
classified as discourse, syntactic, or lexicographic characteristics to 
predict holistic scores that had been Independently assigned. Paralleling 
the findings of Freedman, they found that the discourse characteristics 
were the best predictors of holistic scores. However, unlike Freedman, 
Breland and Jones Included a group of essays written by Hispanic ESL 
students. In this group, syntactic and lexicographic scores were 
relatively much more Important. Subject-verb agreement and range of 
vocabulary were particularly strong correlates of holistic scores in the 
Hispanic group. This finding may simply reflect the greater range of 
syntactic and lexicographic skill found in an ESL population. Regardless 
of the reason for the differences in the ESL group, this study serves as a 
useful reminder that even well-established "facts" concerning the scoring 
of writing samples may have to be modified for ESL populations. 

In native speakers, for example, organizational skills usually parallel 
mechanical skills, and it is unusual to find highly organized essays 
written by students with very poor grammatical skills. With students for 
whom English is a second language, a greater disparity between 
organizational skills and mechanical competence in English would not be 
unreasonable to expect. 

If a single holistic score is to be used, the raters must agree on how 
to score essays that present a large discrepancy between organizational and 
mechanical skill. They must also agree on which mechanical errors are most 
serious. This judgment of error gravity may stem from a strictly 
functional communication point of view (Does this error Interfere with what 
the author is trying to say?), or it also may penalize errors that are 
stylistically undesirable (e.g., redundancy, run-on-sentences). In 
addition, raters must agree on how to evaluate essays that contain complex 
sentence structures, and in which the Wi. iters make errors in trying to 
write complex sentences, versus essays that use only simple sentences but 
contain few errors. In her research, Greenberg (1983) noted that ability 
to avoid errors predicted teachers* quality ratings better than the 
writer's ability to handle complex syntactic structures. She found that 
one major problem consisted of word form errors. Shaughnessy (1977), in 
fact, recognized that word form errors exemplify "advanced errors." Such 
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errors indicate attempts to acquire formal academic vocabulary in spite of 
the risk of making errors* Thus more competent writers may commit more 
errors, yet may be penalized by raters who focus on the lack of errors as a 
predominant feature of good writing* During the training for holistic 
scoring, discussion about errors should be limited so as not to interfere 
with the process of reading for total impression, and to ensure that 
particular features of writing do not unduly influence that total 
impression* 

Despite the most rigorous procedures in the training of scorers, 
holistic scoring schemes Inevitably require some degree of subjective 
Judgment, and these subJecLive Judgments may be particularly difficult when 
the writer and reader (scorer) do not share a common set of cultural 
conventions and expectations* These conventions go far beyond mere differ- 
ences in grammatical rules* The work of Kaplan (1966) clearly demonstrated 
cultural differences in patterns of logic used to order ideas within 
paragraphs. For example, Kaplan suggests that Anglo-European expository 
essays typically follow a linear development* In contrast, paragraph 
development in Semitic languages is based on a complex series of parallel 
constructions of coordinate rather than subordinate clauses* Oriental 
essays use an indirect approach; the reader is told how things are not, 
rather than how they are* In French and Spanish essays, Kaplan noted more 
digression and introduction of extraneous material than would be considered 
acceptable in an English essay. Thompson-Panos and Thoraas-Ruzic (1983) 
recently noted certain contrasting features of English and written Arabic 
that may contribute to perceived weaknesses in the writing of Arab ESL 
students* For example, paragraph development in Arabic languages consists 
of a series of parallel constructions connected by coordinating 
conjunctions, thus deemphasizing the use of subordination that is valued in 
English paragraph organization* 

ESL teachers who are aware of distinct cultural patterns may assign 
essay ratings that differ significantly from ratings of English teachers 
with no ESL experience* On the other hand, if the criterion for competence 
is success in a standard course in a United States university, the 
"insensitive" ratings may better predict academic performance than the 
culturally sensitive ratings* In this study, we compared ratings by ESL 
readers with ratings by readers whose predominant experience is with native 
speakers of English. In addition, these ratings were compared to ratings 
given by faculty members in engineering and the social sciences. The 
classic research of Diederich et al*, (1961) suggests that, even among 
native speakers, different "schools of thought" exist iimong readers, and 
that certain professions are more likely to emphasize a particular 
characteristic* For example, lawyers appear to focus more on organization, 
whereas editors tend to focus on style and wording. In our research, the 
essay readers completed a questionnaire intended to identify the features 
they attend to when evaluating a composition. 

Because analytic scoring yields more scores than holistic scoring, it 
is potentially more valuable for prescribing educational interventions for 
individual students. One scoring scheme that has been used extensively 
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with ESL students provides separate scores for content, organization, 
vocabulary, language usage, and mechanics (Jacobs, Zlnkgraf , Wormuth, 
Hartflel, & Hughey, 1981)* Other analytic scoring schemes provide for even 
flner-gralned analyses* However, the apparent advantage of several 
separate scores Is frequently an Illusion; the reader's general Impression 
Is likely to Influence ratings on each of the ''separate*' aspects being 
evaluated. In addition, analytic ratings are very time consuming* Wiseman 
(1949) found that four general Impression markings were equivalent In time 
and effort to one analytic marking* As noted previously, despite consid- 
erations of efficiency, a single holistic score may not adequately describe 
an ESL student with discrepant organizational and mechanical skills* 
Further research Is needed to determine the best compromise between a 
single score and a complex analytic scoring scheme, as well as which kinds 
of scores are more appropriate to specific situational contexts* 

The most promising means for the objective scoring of essays may be by 
computer software such as Bell Labs* "Writer's Workbench" (Cherry, Fox, 
Frase, Gingrich, Keenan, & Macdonald, 1983; Kiefer, & Smith*, 1983). This 
sophisticated word processing tool can Identify such features as spelling 
errors, overuse of a particular word, and sentences that are consistently 
too long or too short. Analysis of these structural features might help 
some writers Improve their writing* However, this kind of computer program 
cannot judge how well a piece of writing accomplishes Its main purpose of 
communicating with Its Intended audience, nor can It evaluate features such 
as development and organization* The subjective Impression of coherence 
that a reader "receives" from the written communication cannot be 
duplicated by a mechanical count of cohesive elements (Carrell, 1982)* 

In this study, essays were scored hollstlcally using scoring techniques 
developed at ETS (Godshalk et al*, 1966) and refined over the years as a 
standard procedure In several ETS testing programs* In holistic scoring, 
judgments are made about qualities of the essay as a whole rather than by 
obtaining numerical counts of specific features* But holistic scoring does 
not Imply that only one global score may be assigned to each essay* 
Several different characteristics of an essay may be evaluated 
hollstlcally* The faculty members responding to our survey of academic 
writing skills Indicated that most departments would like to see more than 
one score assigned to each essay* Therefore, we also planned to generate 
three holistic scores for each essay: one for content and quality of 
Ideas, one for grammatical and mechanical errors, and one for organization 
and coherence* Subsequent to the proposal, Breland and Jones* (1982) 
research, alluded to In a previous section of this chapter, suggested that 
no more lhan two scores could be assigned Independently to writing samples, 
as did consultation with other experts In holistic scoring procedures* The 
scoring procedures we adopted are discussed In the chapter devoted to 
scoring the direct measures* 

In addition to the holistic scores, we also planned to obtain simple 
analytic scores (e*g*, total essay length, average sentence length, 
subject-verb agreement) r However, Instead of obtaining analytic scores for 
papers using human judges, a representative subsample cf papers written In 



ERLC 



31 



response to each topic were analytically analyzed on Bell Laboratories* 
Writer's Workbench software at Colorado State University. Joy Reld, an ESL 
faculty member and researcher, supervised these analyses, while Roberta 
Scott, a composition Instructor, keyed the papers Into the computer. A 
complete description of this procedure appears In a subsequent chapter. 

Scorers 

The scorers Included Individuals experienced with assessment In ESL and 
English composition. Including a core of scorers experienced In holistic 
scoring. 

In order to obtain additional and Independent scores for the writing 
samples, we also obtained ratings for a subsample of papers from faculty 
members from the two academic disciplines with the largest foreign student 
enrollments. They were asked to evaluate the papers from the perspective 
of writing skills exhibited at the time of admission to their program, 
rather than from the perspective of writing skills expected to be developed 
as students develop discipline-specific writing skills. These ratings by 
faculty members made it possible to compare scores assigned by subject 
matter experts with scores assigned by writing experts, providing some 
indication of the extent to which points of view regarding *rriting compe- 
tence reflect different perspectives within these disciplinee. 

A slight change from proposed procedures was made with regard to the 
rating of papers by subject matter readers because of the dearth of 
potential business majors. Instead of using subject matter experts 
representing business and the hard sciences, four faculty members in each 
of two disciplines, the social sciences and hard sciences, assigned ratings 
to representative samples of papers written in response to two essay 
topics, one of each of the two types. 

Scoring procedures 

The holistic scoring procedures basically were those outlined in the 
proposal. However, some alterations were made in order to ensure the 
quality of the readings. 

Because the TOEFL program is considering the addition of a writing 
sample to their operational testing program, the objectives of the holistic 
scoring were twofold: (1) to obtain valid and reliable scores to contribute 
to the statistical analyses relevant to the research objectives regarding 
the construct validation of GRE and TOEFL scores, and (2) secondarily, to 
provide information about scoring procedures that would be useful to an 
operational writing assessment program. The holistic scoring sessions were 
carefully designed in the light of these objectives. The basic operational 
procedures used at ETS were employed, but additional complexities were 
introduced because a controlled research design also was imposed. 
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Psychometric and Interpretation Factors 



Although psychometric considerations of reliability and validity are 
essentially the same for ESL essays as for essayc written by native 
speakers y the unique cultural and linguistic characteristics of ESL 
students require special attention* 

Reliability or consistency of essay scores can be assessed In a number 
of different ways (intrarater, Interrater, across topics within genre^ 
across genre). Intrarater reliability Indicates how consistent a single 
rater Is In scoring the same set of essays twice with a specified time 
Interval between the first and second scoring* Interrater reliability 
estimates the extent to which tw or more raters agree on the score that 
should be assigned to an essay* When essay writers and ratera represent 
different cultural perspectives » Interrater reliability Is likely to be 
lower than when both essay writers and raters come from a homogeneous 
group* But even If Interrater reliability Is perfect , the claim cannot be 
made that the essay test Is perfectly reliable* Other factors such ae 
variations over time, from one topic to another, and from one sample of 
students to another also must be considered* 

Intertoplc reliability assesses the extent to which the rank ordering 
of student scores depends on the topic* Scores will vary from one topic to 
another even within the same general topic type (e*g*, compare and 
contrast)* A relatively small Intertoplc variation In a group representing 
a single cultural group may become quite pronounced In a culturally diverse 
sample If one of the topics is particularly evocative for students from one 
culture* For example, a topic comparing life In a democracy to life In a 
dictatorship may represent an abstract academic exercise for North American 
students but may stimulate an intense personal reaction from students from 
Central America* In addition, variations from one topic type to another 
(e*g*, narrative vs* persuasive) may be even more Influenced by cultural 
factors. 

High reliability does not provide sufficient evidence that a test is 
valid* Instead, the test may be measuring consistently a variable that is 
not the criterion of primary interest* Thus, a 30-'minute writing sample 
might be Judged reliable, but it might not serve as a valid indicator of 
the student's ability to write a long paper without limitations on time and 
with an opportunity to make extensive initial drafts* 

As Cronbach (1971) has noted, it is not tests that are validated but 
rather -interpretations of dat ^ from tests used in specific contexts* 
Scores from an essay test may be valid for one purpose but not another* 
For Instance, a test that serves as a valid Indicator of skill in writing a 
narrative essay may have little value in predicting a student's ability to 
meet the writing demands in a graduate engineering program* Furthermore, a 
test that is considered a valid predictor of success in meeting the writing 
demands of undergraduate study for native speakers may or may not predict 
with comparable validity for ESL students* 
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Optlmally, validity should be determined by establishing that a test is 
measuring the same performance objective that a good external criterion 
also is measuring • When the parameters that condition a measure of writing 
skillc are taken into account, the external appearance of a writing sample 
topic, or its face validity, is not sufficient to ensure the validity of 
the performance that is intended to be measured. An objective means for 
determining the validity of scores on a writing sample can be achieved by 
correlating these scores with scores on other measures that have been 
demonstrated to predict well to the same criterion. This criterion, like- 
wise, must have evidenced validity and reliability* One frequently used 
criterion of academic success, such as the grade point average, may not 
meet consistently the constraints of validity and reliability. Instead, 
valid and reliable scores on an established test that has been shown to 
predict to the criterion (i.e., grades) may serve as a more objective 
indicator for validating writing sample scores. The validity of scores for 
writing samples that are includec in standardized tests, for example, is 
established by demonstrating that the scores are highly correlated with 
scores on indirect measures of writing ability. 

Ideally, however, scores on direct and indirect measures would not be 
perfectly correlated. Because a writing sample requires the production of 
a composition in contrast to the recognition of correct responses on a 
multiple-choice test of writing a' .lity, we would not expect the two types 
of test to assess identical skills. Instead, they would be highly corre- 
lated because some of the skills they are measuring overlap and reflect a 
form of "general" writing ability. In addition, writing samples would be 
expected to contribute additional information about writing performance 
that is not yielded by an objective test, thus explaining an imperfect 
correlation. 



Test validation is a process of accumulating evidence to support 
inferences made from test scores, reflecting the value of a test for an 
intended purpose; more sources of evidence are better than fewer. For this 
study, we intentionally planned to score the writing samples in various 
ways, and to relate these scores to other measures, in order to obtain as 
much information as possible regarding the validity of direct measures of 
writing in the TOEFL and GRE contexts. The different procedures and 
analyses are discussed in subsequent chapters. 



Data Analyses 

We performed several statistical analyses of the data, consisting of 
correlational and factor analyses. The specific analyses and results 
appear in the final chapters of this report. The data analyses were 
conducted to reveal the degree of relationship among several variables— GRE 
section scores and item type scores, TOEFL total and section scores, scores 
on indirect measures of writing ability (included in the TOEFL and as a 
separate test for native speakers of English), and the different scores 
derived from direct measures of writing ability. In addition, the obtained 
relationships were examined with respect to the different language groups 
(Arabic, Chinese, Spanish, and English). The objective of the data 
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analyses was to provide information about the content and construct 
validity of the GRE and TOEFL examinations; in particular, the data would 
suggest the extent to which writing ability contributes to GRE and TOEFL 
test scores. 
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II. DEVELOPMENT OF RESEARCH INSTRUMENTS 



In preparation for making comparisons of direct measures of writing 
ability with Indirect measures and with TOErL and GRE scores, writing tasks 
were carefully designed for the assessment of performance on writing 
samples. The procedures used In designing, pretesting, and pilot testing 
the writing tasks are described In the following section of this report. 
For the sample of students for whom English in the primary language, and 
for whom, therefore, only GRE scores were available, a multiple-choice 
section retired from the Law School Admission Test was used to provide the 
indirect measure of writing ability; that test is described in the second 
section of this chapter. For the readers of the student papers, who 
represented two different disciplines — ESL and English composition — a 
questionnaire was developed to survey the readers' general perspectives on 
the evaluation of writing and their reactions to scoring the writing 
samples in this study; this questionnaire is described in the final section 
of this chapter. 



Development of Instruments for the Direct Assessment of Writing 

This process, a critical element of the study, demanded attention to, 
and, to the extent possible, control of the numerous factors that Influence 
a direct assessment of writing ability. Besides heeding the many 
considerations that normally influence the design of the writing task, we 
needed, through pretesting and pilot testing of topics, to test our 
assumptions concerning the writing performance that would actually be 
elicited by our particular topics and the tasks they presented. The topics 
then were pretested, resulting in the selection of a reduced number of 
topics with the potential to tap writing performance effectively. 
Furthermore, these topics were pilot tested, and the resulting writing 
samples were scored in an essay reading that focused on the writing 
performance elicited by the topics. Eventually, the final topics that were 
selected for administration to the large sample of international and U.S. 
students were refined and formatted in carefully designed test booklets. 
The detailed descriptions of these procedures are presented in this section 
of the report. 

Development of Writing Tasks 

This effort was based on the information obtained in the sorvey of 
academic writing tasks summarized in the preceding section. Although our 
survey indicated that no single type of writing sample topic was univer- 
sally accepted by all academic disciplines surveyed in the study, two topic 
types were selected as most representative of the kinds of writing tasks 
that would be useful performance indicators for institutions of higher 
education during the admissions process. As described previously, the 
compare/ contrast topic type was selected as an effective contrast to the 
graph/ chart topic type; the fact that departments perceived these two types 
of tasks as distln..tly different suggests that writing samples elicited by 
these tasks may demonstrate different writing skills as well. 
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Several parameters affecting the design of a writing task were taken 
Into acccount while the staff wrote the preliminary set of topics: 

o The content of the topics needed to be equally accessible to 
the variety of students who would be responding to It. Since 
th(! nonnatlve speakers of English In the research sample would 
coine from different cultural backgrounds, the content Implied by 
the topic could not favor a particular set of personal or cultural 
experiences* Subtle biases In the topics were avoided by 
eliminating topics that suggested controversial social norms (e.g»» 
family size reflecting family planning) » or social conventions 
(e.g., the Dewey Decimal System In a library), or cultural 
perspectives (e.g., assumptions of American middle class views). 

o In addition, the terminology In the writing tasks was to be free 
of vocabulary and concepts that required specialized knowledge 
for an effective response to the topic. Because the primary 
objective of the topic was to stimulate performance representative 
of the student's writing ability, culture-bound terms and concepts 
present artificial obstacles to that performance. Similarly, 
writing tasks th^t posed a high level of reading difficulty and 
vocabulary weiTe ^Ised or eliminated, since reading ability and 
differential standards of English vocabulary mastery would confound 
the assessment of writing skills. 

o Topics also were designed to diminish the possibility that 

emotional responses would be evoked by the subject matter or by the 
hidden agenda of the task. Topics stimulating a highly personal 
reaction could either create an emotional obstacle deleterious to 
performance or lead to the production of a writing sample In the 
form of a personal essay rather than In the mode of discourse that 
was Intended. 

o The subjects of the tasks needed to be sufficiently compelling to 
the writers and, eventually, to the readers of the writing samples. 
Each subject was chosen to be Interesting enough to engage the 
writer's Interest and promote some latitude In responding— 
providing the writer with a relatively challenging task and the 
reader with a range of performance to be evaluated. 

o From the standpoint of the evaluation of writing samples, 

experience with large-scale testing programs at ETS Indicates that 
the most effective prompt for the writing task Is one that elicits 
an optimal range of responses . Ideally the range of responses 
should be sufficiently broad to make distinctions, yet not so broad 
that the responses are too divergent to compare on an evaluative 
scale or so narrow that the writers are limited In demonstrating 
their abilities to deal with the assignment. Although a writing 
task may have the appearance of meeting this requirement, Its 
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success can be verified only by collecting a representative number 
of writing samples and observing the actual performance of writers 
who respond to the task* 

The length and specificity appropriate to the writing should be 
conveyed to the writer by the topic and accompanying cues* The 
topic should communicate enough about expectations to elicit the 
writing performance that Is desirable for evaluation* If the task 
falls to Communicate these expectations to the writer, reader^" will 
have difficulty accoomodatlng their Judgments about the writing 
samples that are obtained to their evaluation criteria, and writers 
may be penalized Inadvertently for falling to address the task 
appropriately* 

The mode of discourse or form the written product will take (e*g*, 
personal essay, persuasive argument) also Is conveyed by the 
writing stimulus* This constraint on the writing task is 
determined by the objectives for evaluating writing performance* 
For example, if the ability to develop a personal essay is to be an 
important objective of a writing assessment, the task should be 
structured to elicit personal writing* However, in this instance 
the previous survey of writing tasks indicated that specific types 
of wricing are valued within specific academic contexts* Since the 
goal of this research was to obtain writing samples that elicited 
these forms of writing, the stimulus needed to be designed so that 
student writers would respond with the expected forms* The form 
the written product takes is conveyed not only by the explicit 
Instructions to -summarize" or "describe" but also by the content 
of the topic that serves as the vehicle for expressing ideas within 
format* Thus the writing stimuli need to be written with 
consideration for the kinds of writing that might be anticipated 
when prompted by a particular ideational structure* For example, 
a compare/contrast topic might be so emotionally charged that most 
students would produce a personal essay in response* Although most 
writing tasks actually consist of a combination of modes of 
discourse, the dominant mode that is to be evaluated must be 
emphasized* 

Many experts in English composition stress that the purpose and 
audience for the writing sample should be specified* Because we 
were attempting to attend to the potential cross*cultural 
differences of the students who would be writing for this study, 
the audience and purpose were not stated specifically* The 
designation of a specific audience and purpose may have introduced 
cultural and experiential bias (e*g*, liberal arts candidates vs* 
engineering candidates) — audience specification must be explicit, 
but also appropriate* In addition, we made the assumption that the 
TOEFL and GRE General Test candidates are keenly aware of the 
purpose and audience for a writing task that is a part of an 
examination taken for the purpose of demonstrating a level of 
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English proficiency for admission to Institutions of higher 
education In the United States* Eventually, as we critically 
analyzed writing samples obtalred In the pilot testing* we 
determined that audience and purpose needed to be more clearly 
prescribed for the chart/graph topics In order to clarify the 
expectations for this topic; students had responded differently to 
the original prompt — some wrote a descriptive piece » whereas others 
presented Interpretations of the data* Refinement of this topic 
resulted In writing samples that more nearly met the task demands 
In the final administration* Our experiences with topic design as 
a result of pretesting and pilot testing are described In more 
detail In a subsequent section* 

Fro a the measurement standpoint » one writing sampla Is equivalent 
to a one-Item test * In Interpreting the results of an assessment » 
serious validity and reliability concerns restrict generalization 
from such a limited sample of performance* Ideally » the assessment 
of writing ability should consist of more than one item* 

Furthermore » before decisions can be made on the basis of this 
performance 9 we need to be assured that the sample that has been 
obtained within the constraints of a testing situation is 
representative of the individual's writing ability* 

Similar measurement concerns are raised if comparisons are made 
among scores for students who have taken tests composed of 
different questions* For multiple-choice tests containing multiple 
items I this equating problem can be resolved statistically; 
therefore, for example » a score on one form of the TOEFL or GRE 
examination is directly comparable to a score on another test form* 
In large-scale testing programs at ETS, scores on writing samples 
are equated through the multiple-choice test* For scores on 
different writing samples alone, however » the psychometric 
capability for equating items has not been developed* 

When students are given the opportunity to write in nssponse to 
different tasks, we cannot be assured that each student has been 
given equal opportunity to demonstrate writing despite the apparent 
comparability of assignments* Numerous uncontrolled variables are 
introduced in this instance, such as the differential effects of 
topics 9 modes of discourse* and the like* In order not to confound 
the scoring and interpretation of scores on the writing samples 
obtained for this research study » all subjects would be ^.xpected to 
write on the same topics* In addition, all subjects would respond 
to more than one writing stimulus, writing on four different topics 
in randomized order to control for order effects » in each of two 
different modes of discourse (compare/contrast and chart/graph). 
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Pretesttng of Writing Tasks 



Twenty-three students In English Language Institute classes at the 
University of Delaware responded to a survey to obtain their reactions to 
22 essay topics (10 chart/graph, 12 compare/contrast). The students 
represented a variety of language backgrounds and major fields of study at 
the university. They were given numbered examples of the 22 topics and 
asked to assign two ratings to each topic: (1) how difficult it vould be to 
write an essay on the topic (1-5 range » 1 as difficult » 5 as easy), (2) the 
reason or reasons the topic might be difficult to write about (choices of 
giammari Ideas, vocabulary). They also were asked to write the number of 
the chart/graph topic and the number of the compare/contrast topic they 
would most like to write about. In addition, the research, staff met with 
the ESL Instructors to obtain their reactions to the topics and their 
suggestions for revisions or additional topics. The instructors supplied 
valuable Insights regarding the different cultural perspectives of their 
International students and the design of writing tasks that would be the 
most appropriate to the objectives of the study. The International 
students at the University of Delaware reacted to the following topics: 

Chart /Graph 

1. Individual consumption of major foods In the U«S. (line graph) 

2. Factors in the choice of a graduate or professional school (bar 
graph) 

3. Planned fields of study of college seniors (pie chart) 

4. Changes In automobile part production by three companies (bar 
graph) 

5. Expenses for one family (pie chart) 

6. Area and population of continents (bar graph) 

7. Average height of boys and girls from birth to age 20 (line graph) 

8. Changes In farming in the U.S.: 19A0-1980 (bar graph) 

9. Area and population of continents (two pie charts) 

10. Factors In the choice of vocational field (bar graph) 
Compare/ Contrast 

11. Travel and reading are two ways of learning about people and the 
world. 

12. rood as a necessity vs. food as a source of beauty and pleasure 
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13. Potential and limitations of organizations In promoting lat^r- 
national relations 

14. Methods of decision making — careful thinking vs. quick decisions 

15. Deciding between a job that pays well, but offers little enjoy- 
ment, and a job that pays less but Is very satisfying 

16. Ad 'ant ages and disadvantages of exploration of outer space 

17. Occupational preferences for working with other pecple vr. working 
by oneself 

18. Advantages and dlsdvantages of using chemicals to control Insects 

19. Advantages and disadvantages of a commou International language 

20. Preference for spr.ndlng free time In active, physical recreation 
vs. participation In Intellectual activities 

21. Advantages and disadvantages of: the automobile 

22. Advantages and disadvantages of very large vs. small universities/ 
colleget 

The chart/graph topics were purposely designed to present data In 
different forms — bar graphs, line graphs, ana pie charts — In order to 
explore possible differential reactions to these stimuli. In fact, the ESL 
Instructors sense that students have varied degrees of experience with dat^ 
presentations; for example, students from the Mld-^le East seem to be more 
comfortable with abulated data than with graphs and charts. Topic 9, In 
fact, presents tabulated material juxtaposed with the pie charts for this 
reason. The student and Instructor reactions thus guided the selection of 
chart/graph topics that would not create a problem In understanding the 
stimulus. 

Eight topics, four chart /graph and four compare/ cent r as t , were selected 
on the following basis: students perceived them to be of an average range 
of difficulty (2-4); students reported fewer rearons, particularly in 
regard to Ideas or vocabulary, for having difficulty with the topics. 
Their selection of topics they would most like to write about also was 
considered In eliminating less promising topics. The students* preferences 
for the eight topics selected are summarized as follows: 

Chart /Graph 

1. Individual consumption of maior foods In the U.S. (line graph) — 
overall difficulty rating of 4; reasons for difficulty — Ideas 
(D. Food)* 
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4* Changes In automobile part production by three companies (bar 
graph) — overall difficulty rating of 3 or 5; reasons for 
difficulty — Ideas (C* Automobile) 

8. Changes In farming In the U.S.: 1940-1980 (bar graph) — overall 
difficulty rating of 2 to 4; reasons for dlf f lculty~partlcul^»rly 
Ideas and vocabulary (B. Farming) 

9. Area and population of continents (two pie charts)"— overall 
difficulty rating of 3, with a few 4s and 3s; reasons for 
dlf f lculty*--no perceptions of reasons for difficulty (A. 
Continents) 

Compare/ Contrast 

11. Travel and readlnf are two ways of learning about people and the 
world — oveiall difficulty rating of 4; essentially no perceptions 
of reasons for d-'fflculty, though few selected grammar 

(3. Learning) 

12. Adv&ntages and disadvantages of exploration of outer space — overall 
difficulty rating of 3, with a few perceptions of difficulty with 
Ideas and vocabulary (4. Space) 

18. Advantages and dlsdvantages of using chemicals to control Insects — 
overall difficulty rating of 2 and 3, with a few perceptions of 
difficulty with Ideas and vocabulary (2. Chemicals) 

20. Preference for spending free time In active, physical recreation 
vs. participation in Intellectual activities — overall difficulty 
rating of 2 to 4, with a few perceptions of difficulty with ideas 
( 1 > Recreation) 

Pilot Testing of the Eight Topics 

Colleagues who are involved in ESL instruction ofrored to assist with 
the pilot testing of the writing sample topics « Individuals at seven 
different institutions of higher education throughout the United States 
administered the writing prompts d r^ug regularly scheduled class periods 
in August and September of 1983. 

The samples were administered primarily to students who were preparing 
to enter the institutions in the fall, both at undergraduate and graduate 



*A brief "title" for each of the topics is included in parentheses, to 
simplify discussion about these topics later in the report. The 
letters or numbers were assigned to the topics for the pilot test 
essay reading. 
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levels of education • At some schools, all ei^ht topics were administered, 
whereas other schools selected particular topics they wished to use. Some 
students wrote on more than one topic, but most of the writing samples 
'>btalned were written by different students. The Individuals who adminis- 
tered the writing prompts were Instructed to attempt to obtain the samples 
under standardised testing conditions, giving students 'iO minutes to write 
on each topic* 

A total of 447 writing samples were obtained; an additional 30 that had 
been collected by one Institution could not be used because they arrived 
too late to be Included In the reading session* The numbers of writing 
samples obtained for the chart/graph topics were as follows: 46 samples 
for the topic labeled Continents (A); 56, Farming (B); 33, Automobile (C); 
and 52, Food (D)* For the compare/ contrast topics, the numbers of writing 
samples obtained were as follows: 42 samples for the topic labeled 
Recreation (1); 53, Chemicals (2); 100, Learning (3); and 65, Space (4). 

Reading of Pilot Test Writing Samples 

Prior to the reading of the pilot test writing samples, four project 
staff members. Including the project directors and two ETS experts on 
scoring writing samples, read the papers to select examples co be used as 
range finders during the reading of the entire set of 447 samples. This 
sample picking req?ilred the staff members to read, and score Independently, 
nearly the entire set of writing samples, a process that required two full 
days of reading. The objective was to obtain a set of writing samples that 
Illustrated the full range of writing ability, demonstrated characteristic 
problems In scoring, and called attention to typical reader pitfalls (e.g., 
assigning a low score to a short paper). The samples selected were those 
for which two staff members agreed exactly on the score. A third person 
read and scored unusual papers. The reading of the writing samples was 
done hollstlcally to obtain one score to reflect the overall Impression 
of the quality of each paper. On the advice of Gertrude Conlan, an ETS 
staff member with considerable experience In scoring writing samples, the 
project staff decided to try out a slx*-polnt score scale. As the papers 
were read, they agreed that the slx*-polnt scale worked very well; a 
four-point scale would not have discriminated well among the set of papers 
obtained, and a more extended scale appeared to require more distinctions 
than could have been made with confidence. 

Since the numbers of writing samples obtained for each of the eight 
topics were not large enough to Justify separate training for each topic, 
the staff decided that the four graph/ chart topics would be read together, 
as would the four compare/contrast topics. At the conclusion of the staff 
readings of the writing samples, two sets of range finders were selected to 
use In the training of readers for the pilot test samples. For the 
compare/contrast topics, 25 writing samples were selected; for the chart/ 
graph topics, 24. These two sets of range finders were duplicated and 
placed In random order. For easy reference during reader training ^ each 
range finder was labeled alphabetically to correspond to this order and 
also labe?ed either alphabetically or numerically to correspond to the 



43 



-31- 



bpeclflc topic to which the student had responded in each writing sample* 
For example » the first writing sample in the set of compare/contrast range 
finders was labeled A, which was followed by the number 1 to designate that 
the topic was topic 1, Recreation* The range finders then were duplicated 
so that each reader would have copies* The covers had been removed from 
all papers previously to avoid influ .icing the readers with information 
about the writer's nationality or major field of study* 

In preparation for the reading of the pilot test writing samples » it 
was determined that» at a rate of 35 papers per reader per hour with two 
readings per paper » six readers could read the 447 papers* Thin estimate 
allowed for the possibility that ESL papers would take slightly longer to 
read than papers written by native speakers of English* It also included 
time for trainings for introducing each new topic » and for discussion about 
each topic* The staff decided to conduct two reader training periods » one 
using the mixture of range finders on compare/ contrast topics and one 
using the mixture of range finders on the chart/graph topics* 

The reading was conducted on September 7» 1983* Each reader received a 
folder containing copies of the topics and the two sets of sample papers* 
The reader? were instructed that the major objectives were to read the 
papers with the perspective of «?telecting the topics that "worked" best-*-- 
those that showi d evidence of a broad range of writing ability , that 
elicited the kind of writing intended » and that allowed '*e£ders to laake 
clear distinctions in assigning scores-^and to evaluate the efficacy of the 
six-point scale* Their objective at the conclusion of the d£\y was to 
determine which two of the compare/contrast topics and which two of the 
chart/graph topics would be used for the large test administration* At the 
conclusion of the introductory remarks » which also outlined the schedule 
for the day I the c^lef reader began the training period* 

The training of readers 

The first part of the day was devoted to the compare/contrast topics* 
The chief reader asked the readers to read, score» and rank order a set of 
range finders, eight papers that she had selected from the samples for this 
topic type* Although the range finders were drawn from all four topics 
of each topic type, the readers scored all of the papers on one topic at a 
time* In reading the papers » the readers were instructed to read each 
paper quickly from beginning to end, to obtain an overall impression, and 
then to score the paper on a scale of one to six» with a score of one for a 
poorly written paper ^ t minimally addressed the topic and a score of six 
for a top paper with* .he group* The readers were cautioned to avoid 
being inappropriately influenced by the following features of the writing 
samples: neatrasSi handwritings occasional spelling or plurality errors, a 
paper with a good introduction that gradually d3teriorates in quality » and 
a paper with a good closing statement that may or may not make up for 
previous weaknesses* 

When the readers had finished these papers » their scores were tallied 
on a chalkboard* After discussion^ six more papers were introduced, read. 
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and discussed; then a few more were used to assist the readers In refining 
theli definition of papers that si ould receive the midpoint scores ^f three 
and four* The training took approximately one and one half hours. The 
afternoon training, which observed similar procedures for the chart /graph 
topics, was completed In approximately one hour* 

The scoring of papers 

Although sufficient time had been allotted for reading the pilot test 
writing samples, it became clear during the morning of the reading that not 
all papers would be read. Papers on two compare/contrast topics were read 
in the morning, and the remaining two compare/contrast papers and four 
chart /graph papers in the afternoon. Because of time constraints and the 
high interreader agreement on the compare/contrast topic papers » the papers 
on the chart/graph topics were read by only one reader. The project staff 
conducted spot check readings throughout the day to ensure that the readers 
were scoring accurately and reliably. To resolve the very few discrep- 
ancies in scoring, one staff member read these papers a third time* 

Throughout the reading, two staff members distributed the papers to 
readers and collected them as they were read. The aide recorded the reader 
number and score, covered the first score with a black sticker, and sorted 
the papers so that each paper would be read for the second time by a 
different reader. After recording the second score, the aide gave those 
papers requiring a third reading to the designated staff member. 

After the reading of the papers for each topic concluded, the two 
project directors conducted discussions about the merits of each topic. 
The readers evaiuiited a topic in relation to the others of its type and 
suggested specific revisions. At the end of the day, the readers 
Contributed final recommendations for the four topics that appeared to 
yield writing perfomiance that met the objectives for the research and met 
the criteria for effective topics. 

As the readers commented on the scoring, they referred to the papers 
used in the training, and concentrated on the following evaluation 
criteria, with a focus on the efficacy of the topics: 

o Did the writers understand the topic? 

o Did the writers address the topic directly, and did they follow 
instructions? 

o l^<ire the papers written in response to the topics appropriately 
varied in approach, or do they show the topic to be too broad or 
too constraining? 

o Were the writers able to conclude their papers effectively? 

o Did the chart /graph form of presentation present readability 
problems? 
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Concluslons and Implications for the January reading 

This reading session si'irved as an opportunity for a •'dry run** In prepa- 
ration for the final reading of the large sample of papers In January. 
This experience provided concrete Information about the final design of the 
test booklets to facilitate icorlng and the mechanics of the eseay reading 
process. The project staff decided » as a result of this experience » that 
the January reading would require the services of the Essay Reading Office 
st^ff at ETS to organize and run the reading procedures* Although the aide 
for this re&dlng kept the papers flowing wcll» the need for an adequate 
number of aides was clearly evidenced during this reading. 

With regard to reading rate , it ^as concluded that the original esti- 
mate of 35 papers per hour per reader was very accurate—the time pressure 
experienced during this reading resulted from the extended discussion that 
was necessary as the topics were thoroughly examined* The number of 
readers needed was estimated accurately » as well* If some method other 
than holistic scoring were to be used, the estimated reading rate would 
need to be adjusted. 

The training time , especially In the morning » was longer than anti- 
cipated. Since different methods of scoring were planned for January, It 
was agreed that the training would be more complicated* Because thorough 
training promotes reliable reading » sufficient training time should be 
built Into the schedule* In addition, planning would need to anticipate 
stretch breaks for the readers and the availability of additional sample 
papers to keep readers on target during the course of the reading. 

; holistic scoring method worked very well. At this point, the staff 
anticlpateJ the need to plan for the possibility of a two-score reading to 
coatast mechanlcs/gramnar and organization/coherence features without 
sacrificing rellabllltv and efficiency. The readers agreed that this 
scoring system would be worth attempting as an addition to holistic 
scorlu^. 

The readers and ETS participants agreed that the slx-polnt scale worked 
very well on these papers, because sufficiently fine discriminations could 
be made among the papers. The slx-polnt scale may prove useful In the 
future for making judgments If the essay component becomes operational In 
the TOEFL. 

Final Selection of Topics 

Two compare/contrast topics . Recreation and Space, were selected for 
the large-scale pretest administration at the conclusion of the discussion 
of the four topics of this type. 

The two chart/graph topics . Farming and Continents, were selected by 
the readers as a result of their analyses of the four chart /graph topics. 
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Formattlng of the Test Booklet 

The physical layout of the writing stimulus requires careful considera- 
tion because the format In which the writing assignment Is presented 
provides cues to the writer that will Influence his or her response to the 
topic. Besides the Instructions, other nonverbal cues can affect 
performance. The pretesting and pilot testing experiences Influenced the 
decisions about design of the test booklet* 

The cover of the booklet clearly Indicates the total time allocated to 
writing, as well as the time limits for each topic. The thlrty-mlnute time 
limit proved to be adequate for most students who wrote responses to the 
pilot test topics. Clearly this amount of time does not allow for much in 
the way of prewrltlng activities or revision. A longer time limit may be 
desirable, but we could not expect International students to spend more 
than two hours writing, In addition to the time taken for administrative 
procedures and rest breaks, particularly since each student would be 
writing four papers. In the event that a direct measure of writing ability 
becomes a section of the TOEFL or GRE General Test, score users will need 
to be Informed that scores for the %n:ltlng samples should be Interpreted In 
the context of restricted time limits and testing conditions. A score on a 
writing sample administered in a testing situation cannot be assumed to 
represent accurately how the student would perform under optimal conditions 
for writing. However, since this study's research conditions should 
parallel the conditions of an actual testing administration, time limits 
that realistically suited this purpose were used. 

The booklet cover also requests information for identifying the student 
as a research subject — name, TOEFL application number, native country, 
major field of study, number of yeers studying English, level of education, 
and sex. The general instructions on the cover ^re designed to communi- 
cate expectations regarding administration procedures; the objective of the 
assessment ("how well you can write"); the criteria for evaluating the 
writing (clear and effective expression of thoughts, emphasis on quality 
vs. length); and the physical presentation of the composition (more than 
one paragraph, writing on every line, a space for making notes). This 
cover was removed prior to the scoring sessions to avoid Influencing 
readers with background information. 

To eliminate the effects of the order in which the essays appeared, the 
four essays were assembled in eight different orders in the booklets; thus 
different students would be responding to different topics at the same 
point in time during the administration. The ordering of the topics was 
not entirely random, since topics of one type were not presented consecu- 
tively to minimize the possibility that the writer, having responded to a 
compare/contrast topic, would fall into a pattern of responding that would 
not be appropriate to the subsequent topic (i.e., the mriter would be 
responding to the mode of discourse rather than to the new subject matter). 

Each topic also was printed in a different color, which eventually 
facilitated record keeping and scoring p^-ocedures. These colors frequently 
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were used in referring to papers written on a specific topic, and were 
especially useful when readers were recording their scores on the back of 
each test booklet* 

The back cover of the test booklet was designed to allow for several 
different scoring contingencies. Spaces are allocated for scores assigned 
by two readers for each paper » for holistic and two-score nethods, and for 
a third reading, If warranted. The size of the rectangles In which the 
scores would be entered corresponded to the size of the stickers that would 
be placed over a score assigned by the first reader of a paper. For a 
large-scale essay scoring operation, machlne-scannable score sheets can be 
designed Instead. We did not develop % machlne-scannable score sheet, 
however, since the Initial expenditure Is considerable, though a worthwhile 
Investment for a continuing test progiia. 

Selection of An Indirect Measure of Writing Ability for the GRE Sample 

One of the objectives of this reserrch was to Investigate the relation- 
ship of scores on Indirect measures of writing ability (multiple-choice 
writing tests) with scores on direct measures of writing ability (writing 
samples). The sample of subjects In the study was to be composed of two 
groups: (1) International students (graduate and undergraduate) who are 
nonnatlve speakers of English and are taking the TOEFL examination; and 
(2) United States entry-level graduate students who are native speakers of 
English and are taking the GRE General Test, all as candidates for 
admission to inftitutions of higher education in the United States. 

For the international candidates. Section 2 of the TOEFL, Structure and 
Written Expression, served as an indirect measure of writing ability. This 
section is composed of two parts. The first part contains items that 
measure the understanding of basic grammar and syntax; the items are of ^the 
"sentence correction^ item type. The second part, consisting of "usage- 
items, tests knowledge of the grammar and usage of written Engllttli and has 
been demonstrated to have a consistently high correlation with writing 
atility (Pike, 1979; Pitcher & Ra, 1967). 

For the United States candidates, however, the GRE General Test does 
not contain an indirect measure of writing ability. Thus, in addition to 
responding to the four writing sample research instruments, these 
candidates would have to take a separate indirect writing test. The 
project staff selected sections of a standrrdized test previously admin- 
istered as a part of the Law School Admission Test (LSAT). These sections 
have been discontinued and disclosed, since LSAT candidates now respond to 
a writing sample as a direct measure of writing ability. The LSAT indirect 
writing test is appropriate to this sample, in that it was developed for 
students who have completed undergraduate education and are candidates at 
the graduate level. Further, its item types are parallel to the item types 
in the Section 2 of the TOEFL examination. Section 5 of the old LSAT was 
composed of sentence correction items. Section 6, of usage items. 
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The particular form of the LSAT Indirect writing measure was selected 
from among the five forms that were administered during the last year it 
was in use. In consultation with a test de^relopment specialist at ETS who 
is familiar with the LSAT, we chose Form 3BLS4 on the basis of the 
following criteria: the mean difficulty (delta) for the items is approip- 
Imately 12, an appropriate level of difficulty for GRE candidates; the 
items represent a good range of deltas; and the form contains only a few 
items that have low biserial correlations (at or below .30). In addition, 
the specific items in this test form have good face validity and do not 
contain content that is culturally biased. The Law School Admission 
Council granted permission to the project to use the test form for research 
purposes. 

Since this form was originally contained in a complete LSAT examina- 
tion. Sections 5 and 6 were slightly redesigned to create a nine-page test 
composed of a total of 60 items. The instructions that had appeared on 
this section of the LSAT were essentially used verbatim, with only minor 
modifications because the test would not be administered as a pa-t of a 
larger test. The same time limits, which had been reasonable (data 
indicate that it was not speeded) for LSAT candidates, were retained. 

Development of the Essay Reader Questionnaire 

The reliability and validity of methods used for scoring writing 
samples is influenced strongly by the readers who apply these methods. As 
they evaluate samples of writing, readers are making judgments that are 
conditioned not only by training at the time of the essay reading but also 
by their personal perspectives with regard to their definitions of "good" 
writing and the evaluation of writing ability. One of the objectives of 
this research was to determine whether readers who represent different 
academic points of view would score the writing samples differently; thus 
the large-scale essay reading session was designed to involve equal numbers 
of readers with experience in English and ESL Instruction and to have each 
paper read by readers from both disciplinesc Information regarding the 
agreement of readers, such as interreader reliability coefficients, would 
provide some evidence of agreement among these readers. High reliability 
coefficients indicate that, when subjected to training in scoring methods, 
different readers assign essentially the same scores to the same paper. If 
interreader correlations ars moderate or low, the reason(s) for their 
disagreement should be investigated. 

When high interreader reliability coefficients are obtained, this may 
have two explanations: (1) the training sessions enabled readers, who may 
or may not have common views, to agree on common criteria for evaluation, 
and/or (2) despite the training, readers, especially those who are involved 
in the field of writing (English or ESL), tend to agree on criteria for 
evaluation. With low or moderate interreader reliability coefficients, 
other questions are raised with regard to the following: (1) how and if the 
training sessions could have been improve! to obtain higher agreement; (2) 
what readers perceive to be their personal criteria for good writing; and 
(3) whether the personal criteria held by readers are significant to the 



49 



ERIC 



-37- 



evaluatlon of writing and should have been taken Into account during the 
training* To gain Information about readers* points of view, a reader 
questionnaire was designed* The staff decided that readers would be asked 
to repond to the Instrument at the conclusion of each day of essay reading, 
rather than prior to the readings, to avoid heightening reader 
sensitivities with questionnaire prompts about evaluation criteria* 

The reader questionnaires differed slightly on the two reading days 
because readers would be asked to react to and compare the two different 
scoring methods used during each session at the conclusion of the second 
day of reading. The questionnaire for the first day, Saturday, attempted 
to learn about an Individual reader *8 criteria for evaluating writing 
skills with three types of Items: (1) the features of writing that are 
valued In actual practice outside the formal reading session (e*g*. In the 
classroom, writing center), (2) the features of writing that Influenced the 
evaluation of writing samples during the essay reading, and (3) reactions 
to the scoring system used during that day cf essay reading* The writing 
features to which readers would be asked to respond consisted of Identical 
lists of features for the first two questions* This list of features Is 
nearly Identical to the list that was used as a part of the questionnaire 
In our previous survey of academic writing skills (Brldgeman & Carlson, 
1983), with the addition of one feature, "mastery of the conventions of 
grammar,** The Sunday questionnaire omitted the features to be evaluated In 
question 1, but again asked about the features of writing that Influenced 
the reader's scoring during the second day of reading and for reactions to 
the scoring method used that day (two-'score). In the final section of the 
Sunday questionnaire, the reader was asked to evaluate the scoring methods 
used on the two days. Throughout both 4uestlonnalres, the readers were 
given an opportunity to supply comments as well* 

Therefore, the Saturday and Sunday questionnaires should provide more 
detailed Information about points of view with regard to the evaluation of 
writing ability, such as the following: 

o Which features of compositions are most highly valued In Judging 
the quality of writing? 

o Which features are relatively unimportant to judging the quality of 
writing? 

o Do academics from the different disciplines, ESL and English, place 
different emphasis on the features of compositions as they evaluate 
them? 

o How well do the criteria held by the readers match the explicit or 
Implicit criteria employed In the reader training sessions? 

c Can we assume that the training of readers Inf luenced^-relnf orced, 
altered, or diminished — their personal criteria for the evaluation 
of writing? 
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o Do the criteria used for scoring writing sanples during a formal 
essay reading have relevance to criteria used In the classroom? 

o How can essay scores on a standardlted test such as the TOEFL or 
GRE examinations be reported meaningfully (a question of appropri- 
ateness or validity)? 
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III* ADMINISTRATION OF EXPERIMENTAL TESTS 



While the topics for the direct measures of writing ability were being 
developed » pretested » and pilot tested, arrangements were being made to 
administer the experimental Instruments at TOEFL test sites and at 
Institutions In the United States* 

International Administration of Writing Samples 

The project staff worked closely with the TOEFL program staff to 
Identify International TOEFL testing centers with sufficient volumes of 
candidates from which to draw the research samples In countries in which 
Arabic, Chinese » and Spanish were the native or primary languages* Test 
centers in eight countries — two Chinese-speakings three Arab-spesklfigs 
three Sparish-8peaklng~were selected initially; after preliminary contacts 
were made regarding data collection ^ two additional Spanish-speaking 
centers were added** The TOEFL test center su^^ervisors or agents st these 
sites received letters inviting them to participate In the research study 
during the late summer and fall of 1983* The letters contained information 
briefly describing the project and test administration procedures; the 
minimum and maximum numbers of TOEFL candidates, by levels of education and 
major fields, that were our sampling objectives for the site; suggested 
procedures for identifying or selecting candidates; a request for the 
supervisors* recommendations for candidate remuneration; and the form of 
reimbursement for their services* All the test center administrators who 
were contacted agreed to participate in the study* 

A project objective had been to obtain all writing sample data at the 
time of the November TOEFL administration; however, following this 
administration, further testing at certain sites was scheduled for the 
January TOEFL administration as well* An Important requirement for the 
data collection was that the administration of the experimental writing 
samples take place as close in time as possible, preferably on the same 
day as the TOEFL examination* This requirement was imposed so that both 
the TOEFL scores and the essay scores would be collected when a candidate 
was at a particulc^r level of English proficiency; since English language 
proficiency is a developing ability, scores obtained on each measure at 
different times would not be comparable because they would be confounded by 
Intervening experiences with the English language* Most test centers did 
administer the writing samples on the afternoon of the TOEFL examination, 
following a lunch break; one Arabic center administered them on the 
preceding day. 



^Hereafter in the text, testing centers with a specific native or 
primary language will be referred to as Arabia, Chinese, or Spanish* 
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Another Important objective of the study was to obtain data from a 
subsample of TOEFL subjects Who had recently taken, or planned to take In 
the near future, the GRE General Test. Some centers were able to match 
candidates who had registered for the November and January TOEFL 
examination with a list of students In their vicinity who had t<iken, c- had 
registered for, the GRE; these lists of GRE candidates were either pre^ red 
at ETS and sent to the test center administrator to do the matching, or the 
staff at AMIDEAST in Washington, D.C. prepared a list that matched 
candidates, or the administrator had the information on GRE candidates in 
order to do the matching (particularly if he or she also served as GRE 
administrator). This requirement proved to be a difficult one to meet — all 
international GRE candidates do not necessarily take the TOEFL. The test 
centers worked very hard to meet this objective, but were not able to 
identify and test as many TOEFL/GRE candidates as had been anticipated. 

In addition to meeting the objectives of continuity of administration 
of the TOEFL and the collection of writing samples and for graduate-level 
TOEFL candidates with GRE scores, the administrators at each location were 
asked to meet the following criteria for the selection of research 
subjects: minimum/maximum numbers of candidates to be tested, the subjects 
taking the TOEFL at a point when they are ready to apply or are prospective 
candidates for admission to an institution of higher education in the 
United States, and subjects whose primary language is that of the country 
in which they would be taking the TOEFL. We also recommended that the 
administrators invite more candidates than were required, to allow for 
attrition, and that the subjects be paid on the same day that they wrote 
their writing samples. 

The TOEFL test centers that participated in the data collection are as 
follows: 

Arabic 

Cairo, Egypt 
Amman , Jordan 
Kuwait, Kuwait 

Chinese 

Kowloon, Hong Kong 
Taipei, Taiwan 

Spanish 

Bogota, Colombia 
Santiago, Chile 
Mexico City, Mexico 
Lima, Peru 
Caracas, Venezuela 
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As the most appropriate arrangements at each location were worked out, 
there were some variations In procedures. The following specifics varied 
across sites: the administrator who planned and carried out the testing 
(TOEFL agentSy supervisor » proctors); procedures for Identifying and 
contacting candidates; the amount of remuneratloa In local currency or 
dollars that was attractive to candidates; the scheduling of the essay 
administration; and the numbers and categories of subjects to be obtained* 
Prior to the test administrations, the following materials were mailed to 
each center: test booklets, supervisor's instructions, and subject 
consent /receipt forms* 

United States Administration of Direct and Indirect Measures of 

Writing Ability 

Data were collected at the following institutions of higher education 
in the United States: Rider College, New Jersey; Rutgers Dniversity, 
New Jersey; Southern Illinois University^ Illinois; the Ohiversity of 
California at Los Angeles, California (UCLA); and University of Southern 
California, California* The campus representatives experienced 
extraordinary difficulties with obtaining 8ubject8~each attempted at least 
two administrations, but were unable to obtain as much data as planned* 
Candidates were offered $20 each fo^^ participating in the study* However, 
this stipend was not sufficiently attractive; it appears also that 
students in the United States are not very willing to spend their leisure 
time writing four papers in a testing situation* 



We anticipated the possibility that the obtained score patterns would 
differ across language groups* The organizational style and nature of 
grammatical errors of Chinese-speaking students might be different from 
those of Spanish-speaking students, and the relationships of essay scores 
with TOEFL and ORE scores also might vary by language group* Indeed, 
Plke*s (1979) findings indicate the existence of such language group 
differences* Therefore, we planned to study thre& different major language 
groups with large numbers of TOEFL and GRE candidates* 

In any data collection, some of the data may be unusable for a variety 
of reasons, but very few of these cases occurred in this sample (Table 



In the Arabic language group, a total of 154 tests were adminis- 
tered: 44 in Egypt, 63 in Jordan, and 47 in Kuwait* However, 14 of 
these tests were written by students for whom Arabic was not their 
native language; these were omitted, resulting in a sample of 140 
ArablQ-speaklng students, composed of 45 undergraduate- and 95 
graduate-level candidates* 

In the Chinese language group, a total of 232 tests were adminis- 
tered; 69 in Hong Kong (including three graduate-level candidates), and 
163 in Taiwan (including 22 undergraduate-level candidates)* All 232 
tests were usable resulting in 88 undergraduate- and 144 graduate- 
level test booklets* 



Description of the Sample 
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In the Sunlsh language group, a total of 216 teats were 
administered; 35 In Chile, 12 In Colombia, 42 In Mexico, IP In Peru, 
and 10 In Venezuela. Venezuela and Colombia were unable to recruit aa 
many subjects as planned. Five tests were unusable in this group, four 
because the subjects did nut speak Spanlb.i as their primary language. 
Thus for the Spanish language group, a total of 211 booklets, composed 
of 69 undergraduate- and 142 graduate-level writing samples, 
contributed to the an£ lysis. 

In the English language group, a total of 60 tests were adminis- 
tered; 2 at Rider College, 3 at Rutgers University, 36 at Southetti 
Illinois University, 16 at UCLA, and 3 at the University of Southern 
California. Five of these booklets were not usable — two, because they 
were Incomplete, and three, because the primary language of the mrlters 
was not English. A total of 55 graduate-level booklets for the English 
language group resulted. Because administrators had difficulties 
obtaining participation, this sample of papers may not be representa- 
tive, in that strdents who felt that they were not capable of producing 
four wilting samples chose not to participate. 

The total number of test booklets collected was 662. Three of these 
booklets were Incomplete, however; the remaining 659 test booklets were 
scored during the essay reading sessions. From this sample, 21 
booklets were removed from the analysis because they did not represent 
the primary languaga of the country in which they were collected; the 
primary languages of the writers were an assortment oi other 
international languages. Thus the total number of essay booklets that 
were used in the data analysis was 638 — 211 by Spanish, 232 by Chinese, 
140 by Arabic, and 55 by English candidates. The total number of 
essay booklets written by graduate-le *el candidates with GRE scores wt-s 
165: 59 Spanish, 88 Chinese, 18 Arabic, and 55 English. The groupings 
of candidates by major fields is reported in Table III-l. The total 
Sample is relatively representative of language groups, major fields, 
and graduate-level candidates with GRE scores. 
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Tablc III— 1 
Sample Description 



Language Group 


Undergraduates 




Graduates 






Total 






Business 


Hard Science/Social 
Engineering Science 


Unknown 




\rabic 
Chinese 


43 
89 


7 

30 


64 
65 


23 
47 


1 

I 


140 
232 


Spanish 


69 


39 


6^ 


33 


7 


211 


English 










55* 


55 


Total 


203 


76 


192 


103 


64 


638 



* Because of the smal^ number of native English speakers, no attempt was made 
to classify them separately by intended major. 
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IV. SCORING THE WRITING SAMPLES AS DIRECT MEASURES 
OF WRITING ABILITY 



To compare the results of scoring the writing samples by using 
different scoring methods, the papers were scored as follows: 

o Holistic scoring of all booklets, primarily on the first day of the 
essay reading weekend 

o Two-score scoring of all booklets, for discourse/ sentence charac- 
teristics, primarily on the second day of the essay reading weekend 

o Holistic scoring of a representative subsample of the papers by 
subject matter experts In two major fields of graduate education 

o Descriptive scoring of the features of a representative subsample 
of the papers using the Writer's Workbench software 

Other tasks needed to ^^e accomplished before any scoring was begun, 
however — preparing the test booklets, planning the weekend essay reading 
session, selecting samples for training during the reading weekend, and 
refining sample selection during the training of table leaders for the 
reading weekend. These procedures are described In the next section of 
this chapter, and the application of the four scoring methods Is described 
In subsequent sections* 



Preparation for the Essay Reading Weekend 
Planning for the Reading Weekend 



At the point when plans needed to be made final, all test booklets had 
not yet arrived, nor was all testing completed. Since additional Inter- 
national data collections were scheduled for January, the total number of 
test booklets could only be estimated. It was necessary to estimate the 
greatest number of booklets that might be obtained In order to Invite a 
sufficient number of readers and to organize the readings by tables and 
space. Thus the staff estimated that It would be possible to obtain a 
total of 4,000 papers, or 1,000 papers per topic. To estimate the amount 
of time It would take to read these papers, this figure was multiplied 
times four because each paper would be read twice for the holistic scoring 
and twice for the discourse/ sentence scoring. This resulted in an estimate 
of 16,000 papers to be read. Using the scoring rate that appeared to be 
reasonable during the reading of the pilot test writing samples, that of 35 
papers per reader per hour, and the actual amount of reading time that 
could be planned for one weekend (minus training), it was determined that 
48 readers would be required. 
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To balance the academic perspectives of the readers, the staff decided 
to Invite 24 ESL and 24 English readers who had experience with evaluating 
compositions. The number of table leaders (eight) was determined by 
dividing the 48 readers Into tables of six; this number of readers was 
recommended on the basis of considerable experience with essay readings at 
ETS. Further » it was determined that two chief readers, eight aides, and 
four members of the project staff would be needed when training, space 
arrangements, pa?^er flow^ and the like were taken Into consideration. 

Sample Picking Sessions 

The objective of the sample picking sessions was to select papers that 
represented the range of the slx-polnt score scale, both for the holistic 
scoring and for the two-score scoring of the papers for each of the four 
topics. After a sufficient number of papers had been selected for one 
topic » they were arranged In an order that would be used for discussion at 
the table leaders' meeting, when specific papers would be selected as the 
benchmarks for training readers • The order In which they were arranged 
did not correspond to the sequence of holistic scores, but rather to a 
random sequence that would not suggest some predetermined score* 

On the second day of sample picking, the selection of sample papers 
for the holistic scoring of the four topics was completed. The selection 
of papers for the two-score ratings for discourse-level and sentence-level 
(and below) skills proceeded much more slowly. As the staff and chief 
readers attempted to read the papers to arrive at a general Impression of 
two scores, agreement was difficult to reach, and It was nearly Impossible 
to not reread the paper before assigning one or both scores. Much more 
discussion was requited before criteria for each of the two scores could 
be clarified. Clearly, many of the features that Influence the evaluation 
of discourse-level skills also influence the evaluation of sentence-level 
skills; thus It was d:.fflcult to attempt to rate the two levels Indepen- 
dently. Eventually, the readers were able to arrive at reasonable agree- 
ment on the sample papers for two of the topics, one chart/graph (Farming) 
and one compare/contrast (Space) topic. Everyone expressed concerns that 
the readers would experience the same difficulties, and that the criteria 
would be less saMent, resulting in less Justifiable and less reliable 
criteria. 

Since sample papers for two topics for the two-score method had not 
been selected during the two days of sample picking, the project staff 
spent an additional day selecting the sample papers for these topics 
(Recreation and Continents). The staff appeared to experience somewhat 
less difficulty with determining these range finders. From an operatic nal 
testing pltuatlon, the effort required to identify range finders for tne 
two-score method is not efficient, unless the data demonstrate that the two 
scores can provide Independent information about writing ability. 

Given the time involved in acquiring the criteria for making the 
two-score distinctions, the staff decided i:o formulate an alternate plan 
for the second day of the essay reading weekend. Although it might have 
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been possible for the readers to assign two scores to tUe papers written In 
response to the four topics » readers should not be placed under unrealistic 
time pressures. Rather than sacrifice reader accuracy and reliability, the 
staff decided that the readers would be expected to assign the two scores 
to papers written on only two topics, one of each type, on the second day 
of the weekend reading. The range finders were prepareu for the 
contingency that these readings would proceed more quickly 
than anticipated, however. 

All range finders, arranged In random order, were la>r<iled sequentially 
with letters of the alphabet and numbers designating the specific topic 
(1-4) for easy reference during training and dlscuaalon* They were 
assembled by the Essay Reading Office and printed on colored paper to 
correspond with the colors of the writing stimuli as presented In the test 
booklets. 

Chief Readers' Meeting 

A table leaders' meeting was scheduled for the Friday evening prior to 
the essay reading weeker?do On the afternoon preceding this meeting, the 
project staff met with the chief readers to prepare for the evening 
meeting. The meeting with the chief readers covered the following topics; 
an overview of the objectives of the research study and of the weekend 
reading plans, details of the mechanics of the reading sessions, and the 
agenda for the table leaders' meeting. Several decisions were made with 
regard to the orientation session for the table leaders and the conduct of 
the weekend readings: 

o The responsibilities of the individuals Involved in directing the 
readings 

o The specific functions of the chief readers 

o Thf» specific functions of the table leaders 

Table Leaders' Meeting 

The project staff, chie^ readers, and table leaders met during the late 
afternoon and evening of the day preceding the reading weekend. Eight 
table leaders had been selected ly the project staff: four individuals who 
have considerable experience with English composition and had served as 
readers or table leaders for the New Jerpey Basic Skills essay readingj, 
and four individuals who have experience with ESL composition and with 
essay readings in other contexts. The four ESL table leaders also had 
served as readers during the reading of the pilot test writing samples. 

This meeting covered the following topics: an overview of the research 
and reading plans, the mechanics of the reading sessions, the agenda for 
the evening, the functions of the table leaders, and the preliminary 
selection of sample papers to be presented to the readers for the final 
selection of range finders during the reader training periods. Samples 
were read and selected separately for each of the four topics (e.g, all 
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sample papers for the holistic scoring of the Space topic were read and 
selected first). The goal of the table leaders was to select, for each 
topic, eight papers that would be presented to the readers to represent the 
points on the score scale. These range finders would represent the entire 
range of score points, and should be most "typical- of the scores af each 
point. The range finders would be selected on the basis of having the 
best reader agreement, legibility, and not being too unusual. A few papers 
that represented atypical responses to the topic alao were selected to 
serve as examples of papers that might present problems of which the 
readers should be aware. 

Although we had intended also to select sample papers for the two-score 
scoring, we had underestimated considerably the time this meeting would 
take. Thus, very late in the evening, we decided to select sample papers 
for only two of the topics, one of each type (Space and Farming) and to 
complete the selection during a break in the weekend readings. The same 
process used in selecting the sample papers for holistic scoring was used; 
however, the table leaders needed to arrive at consensus on two scores for 
each paper. During discussion, they expressed the same concerns that the 
staff and chief readers had experienced during the preliminary sample 
picking, but did not appear to have as much difficulty in arriving at 
consensus. Although some substitutions of papers were made, the table 
leaders selected essentially the same r^inge finders that we had selected 
previously. 

Discussions throughout the table leaders* meeting led to the agreement 
that the readers should be alerted to a significant concern that we he 
experienced y a problem of topicality. An important scoring criterion is 
hov well the writer addresses the topic within the constraints of the 
testing situation . Evaluation of the papers should take into account the 
total context — the subjects (native and nonnative speakers of English), 
the testing administration, the research objectives — as discussed in 
Chapter I. We agreed that some papers might seem "off-center," in 
comparison to other papers, if the writers in some way misinterpreted the 
task. If so, the readers would be instructed to focus on the quality of 
the writing, or, if unable to do so, to refer the paper to the table leader 
for scoring by the chief readers. For the chart/graph topics, in 
particulai , the content of sone papers might not be supported by the data 
in the chart or graph, or a writer might make generalizations going bevond 
the data instead of dealing directly with the data. In these cases, tne 
readers would be instructed to place emphasis on the quality of the 
development of the ideas rather than on their .valuation of the correctness 
of those ideas. 



The Essay Reading Weekend 

During the weekend of January 28 and 29, the writing samples were 
scored, with Saturday devoted to holistic scoring, and Sunday, to the 
discourse/ sentence level scoring. 
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Hollstlc Scorin g 

The chief reader for the holistic scoring training described the 
conduct of the readings. Readers were then given their rooa assignments, 
and each of the two chief readers conducted the training for the holistic 
scoring of the specific topics to be read In the two reading rooms. The 
readings of the topics were balanced such that the Continents (pink) 
chart/graph papers were read during the morning In one room, while the 
Space (blue) compare /contrast papers were read during the morning In the 
other room. During the afternoon, the Farming (yellow) chart/graph papers 
were read In one room, and the Recreation (green) compare/contrast papers. 
In the other room. Thus all readers did not read the papers on all topics, 
but each reader scored papers on topics of the two types. Within each 
reading room» ESL and English readers were balanced at each table. Numbers 
were assljned to the readers to facilitate distribution of the test 
booklets, since each paper was read twice, once by an ESL reader, and once 
by an English reader. As determlneo during the table leaders* meeting, the 
conventions for conducting the check readlitgs, re&olvlng discrepancies, 
and other Important arrangements to ensure quality control were carried 
out. 

At the conclusion of the holistic reading sessions, all participants 
filled out the Saturday reader questionnaires. Since the reading concluded 
later than anticipated, the readers were asked to plan to discuss their 
reactions to the scoring prior to the readings on the next day. Saturday 
evening was reserved fnr relaxation — a very Important consideration to 
prevent fatigue. 

Discourse/ Sentence Scoring 

On Sunday morning, the chief readers and table leaders met to make the 
final selection of range finders that had not been selected for the 
two-score scoring method during the table leaders* meeting. While this 
meeting took place, the project staff conducted a discussion with the 
readers to elicit their comments about the holistic scoring procedures. 
Most of these comments also were reflected in the questionnaires. One 
significant reaction, which is relevant to scoring reliability, was that 
the readers felt that the sample training papers for all topics were good 
examples of the score scale; during training, they reached consensus 
readily. 

The discourse/sentence readings began mldmornlng on Sunday, with a 
brief introduction to the scoring procedures by the chief reader, who was 
responsible for this training. The chief readers then conducted the 
training on each topic to be scored in their respective reading rooms. 
During this training, the chief readers recommended that the readers assign 
the score for discourse-level skills first and determine the score for 
sentence-level skills as a second decision. Next the readers in one room 
scored the papers on the Farming (yellow) topic while readers in the other 
room scored the papers on the Space (blue) topic; the papers on the other 
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two topics were not assigned discourse/ sentence scores* The readers 
actually assigned the two scores as rapidly as they had assigned the single 
holistic scores, and did not appear to experience the difficulties we had 
experienced when selecting and scoring the sample training papers. At the 
conclusion of these readings, the readers filled out the Sunday reader 
uestlonnalres* 



Cleanup Readings 

To complete the scoring of all papers, it was necessary to schedule two 
additional full days for reading papers after the weekend reading session. 
One of these days was devoted to the holistic reading method, and the 
second day to the discourse/ sentence reading method. 

For the holistic scoring, the reading staff consisted of readers who 
had participated in the weekend readings (two English and three ESL), 
Including the chief reader who conducted the training for holistic scoring, 
and ETS staff members* For the discourse/sentence scoring, with fewer 
papers, only one English (serving also as chief reader) and one ESL reader, 
plus staff, were needed* The same procedures as those used for the weekend 
readings were carried out to ensure standardized procedures and quality 
control* 

The cleanup readings were required to score new test booklets (60) that 
had arrived after the weekend reading, to score some papers for which two 
holistic scores had not been assigned (72), and to resolve the scores for 
papers with discrepant (more than two points difference) scores* The total 
number of discrepancies for the holistic scores was 49 and for the 
discourse/ sentence scores, 59* Our time estimate for reading the papers 
was appropriate — the readers scored approxlmtely 35 to 40 papers per hour; 
the scoring sessions, of course. Included time ^or training on the sample 
papers* 



Subject Matter Readings 

Although we had planned to ask faculty members to rerd samples of 
papers written in response to all four topics, we decided Instead to ask 
them to read a sample of papers on one chart/graph topic (Farming) and one 
compare/ contrast topic (Space)* When we initially contacted some faculty 
members, they indicated that asking them to read 200 papers was reasonable, 
but that reading 400 papers would be too time consuming and difficult* We 
also chose to obtain scores from more (four, rather than two) faculty 
members from each of two disciplines, and for all the Sf^mples in two sets, 
which would permit more valid comparisons among the several readers* Thus 
four faculty meuibers in each discipline, the social sciences and the hard 
sciences/engineering, assigned ratings to one set of papers for each of the 
two topics* 
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The papers were selected to obtain a sample of papers on each of the 
two topics that were representative of the full range of holistic scores 
for each of the four language groups and for each major field represented 
In each language group* We were not able to represent all scores, 
langue.^es, and major fields for either topic In cases for which the full 
range of scores had not been assigned, however, but the distribution of 
papers Is representative of the total sample of papers* A total of 92 
writing samples were selected for the Space topic and a total of 95 for the 
Farming topic* 

After agreeing to assign scores to the writing samples, each faculty 
member received a letter of instruction, copies of the two sets of writing 
samples, and forms on which to enter the scores* The holistic ratings, on 
a scale of one through six, were expected to reflect the individual's 
views, as a subject matter expert in his field, in the hypothetical 
situation in which such ratings might be used during the process of making 
admission decisions about candidates* The criteria to be Applied to the 
rating decisions were to reflect ''writing competence** for academic work in 
the discipline of the faculty member* 



Writer's Workbench Descriptive Scoring 

The Writer's Workbench is a computer system consisting of several 
programs that offer diverse text analysis features. Including proofreading, 
stylistic analysis, and the rules of English usage* It was developed by 
the Documentation Technologies Group at Bell Laboratories to assist with 
text editing at ATfcT* The system anclyzes prose passages that have been 
keyed in on a computer terminal* It is intended to serve as a tool that 
has its limitations, in that its capabilities do not encompass all the 
complexicies of writing; however, the different programs are based on what 
most experts would agree are the tenets of good writing, such as avoiding 
wordy diction and eliminating passive voice* 

At Colorado State University (CSU), faculty in the English and computer 
science departments obtained permission to use and adapt the Writer's 
Workbench programs for teaching composition, as a "research exchange" 
(Klefer & Smith, 1984; Smith & Kiefer, 1982)* At that time, the Workbench 
was not on the market; It now is available through a lease arrangement* 
TVie CSU faculty modified the programs for the needs of beginning college 
writers and joined the 17 separate programs to run with one command* The 
CSU system was used in this study to obtain numerical data on a variety of 
separate (analytical) features exhibited by four sets of samples of papers 
selected from the total sample of papers collected* One representative set 
of papers, selected on the basis of holistic scores, language groups, and 
major fields of study, was chosen for each of the four topics* For the 
Space and Farming topics, the same set of samples was analyzed on the 
UiTlter's Workbench that was rated by the subject matter experts — 92 Space 
papers and 94 Farming papers* For the Continents chart /graph topic, 92 
papers were chosen, and for the Recreation corapare/conr.rast topic, 90 
papers* 
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Joy Reld, a faculty member in the ESL department at CSU, made arrange- 
ments for Roberta Scott, a composition Instructor who also operates a 
writing service, to key in the papers on the terminals She entered each 
paper verbatim and subsequently obtained Workbench analyses for the four 
sets of papers. The output yields an astounding amount of numerical 
data— with Joy's informed advice, we selected the data that would be the 
most dependable and meaningful in the data analyses. In instances in which 
the data were overlapping, such as the number of pronouns and the 
percentage of pronouns, we selected the percentage figures. The Style 
program produces a large quantity of numerical acores for the various 
quantifiable features of prose, whereas the Prose program supplies 
interpretive comments regarding many of these same features. The Prose 
comnents compare the paper's style values against a set of standards and 
describe the differences to the reader. Since some overlap occurs between 
these programs, we eliminated that redundant data as well. Thus the 
Writer's Workbench system provided a considerable number of objectively 
derived "scores" for the various quantifiable features exhibited by each of 
the papers in the four representative subsamples. 

Descriptions of the features analyzed by the Writer's Workbench are 
presented in the paper by Smith and Klefer (1982). The specific features 
that became relevant in the data analyses are defined more fully in the 
section of Chapter V that reports the Writer's Workbench data. 



Scoring of Other Instruments 

LSAT Indirect Measure 

With permission from the Law School Admission Test (LSAT) program, a 
retired form of the LSAT measure of indirect writing ability was 
administered to the 55 students who are citizens of the United States with 
English as their primary language. The test consists of a total of 6C 
items, 35 of the usage type and 25 of the sentence correction type. The 
tests were hand scored, result" ng in number-right scores for the total test 
and for each of the two sections. 

Reader Questionnaires 

Most of the readers, table leaders, and chief readers completed the 
reader questionnaires at the conclusion of each of the readings on the 
Saturday and Sunday of the weekend reading session. Since the table 
leaders and chief readers also were involved in reading papers, their 
responses were combined with the reader responses. On Saturday, a total of 
50 participants, 24 ESL and 26 English, completed the questionnaires. On 
Sunday, a total of 51 participants, 24 ESL and 27 English, completed them. 

The responses to open-ended questions were recorded verbatim and are 
available on request. These comments were sufficiently informative, 
interesting, and varied that we did not attempt to categorize them. The 
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responses that required choices among a list of alternatives were entered 
on a data tape for further analyses. 

The responses to the reader questionnaires obtained at the conclusion 
of the Saturday readings consist of reader reactions to criteria used to 
evaluate written assignments, both in their everyday experience in 
evaluating writing samples (e.g., in instruction) and as they evaluated the 
papers during the holistic scoring. Additional questions asked for their 
reactions to the holistic scoring of thin sample of papers. The Sunday 
questionnaire responses again asked the readers to respond to the same set 
of criteria to report how they evaluated the papers during the discourse/ 
sentence scoring. In addition, they were asked to react to the discourse/ 
sentence method of scoring and to compare the discourse/sentence scoring 
method with the holistic scoring method. These responses are subjective, 
of course, therefore they do not provide accurate documentation of the 
actual processes used by e readers as they evaluate writing samples. The 
data indicate, however, the readers' perceptions of the various approaches 
to the assessment of writing ability. These results are reported in 
Chapter V. 
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V. RESULTS 



The various test scores that were obtained were viewed In several 
ways — descriptive score distributions, estimates of reliability, 
exploratory and confirmatory factor analyses, and correlational and 
regression analyses. The first two sections of this chapter describe the 
test score data for the different candidate populations in the sample and 
subsamples. The next section reports the estimates of reliability for the 
scores assigned to the writing samples. The following section reports the 
results of the exploratory and confirmatory factor analyses and the 
relationships of the test scores to the factors and othev test scores. 
Finally, the data obtained from the Writer's Workbench analytical scoring 
are described. The sizes of the samples for the total sample and different 
subsamples are not equivalent to the sizes of the samples obtained in the 
data collection. They are somewhat reduced because the d^ta that were 
subjected to factor analyses consist of candidate scores and demographic 
variables that were complete, so that no calculations were made for 
individuals with missing data (e.g., age not reported). 



Descriptions of Scores on the Conventional Tests 
TOEFL Scores 

The total sample of foreign students with complete data and with TOEFL 
scores was 5A2, consisting of 138 Arabic language, 230 Chinese language, 
and 174 Spanish language candidates (Table 1). The mean TOEFL score for 
the total sample was 519.97 with a standard deviation of 64.08. The means 
of the section scores for the total sample on the TOEFL were relatively 
equivalent, rounded to a mean of 52 for each section. 

The means of the TOEFL scores for the Spanish language group are only 
slightly higher than the means for the Chinese .language group, whereas the 
weans for the Arabic language group are the lowest. These means, when 
compared with the normative data for the same language groups (Table 1) as 
reported in the TOEFL Test and Score Manual (1983), suggest that the 
candidates in these samples are above average. This result was 
anticipated, because tl»^ sample consisted of volunteer students who claimed 
they anticipated coming to the United States to study within the next 
year and who apparently felt competent to write in English. 

ORE General Test Scores 

The sample of international and United States candidates who completed 
the four writing samples, and for whom we could obtain scores on the ORE 
General Test, was 172, consisting of 124 international students and 48 
United States students (Table 3). The sample of international candidates 
who took the TOEFL was 124. The number of students in the United States 
sample for whom GRE General Test scores could be obtained, and who also 
completed the LSAT writing test, was 43* 
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The mean GRE verbal score (385) for the total sample (Table 2) is 
considerably lower than the mean GRE verbal score (471) for all examinees 
who took the GRE General Test between 1981 and 1983 (as reported in GRZ 
Guide to the Use of the Graduate Record Examinations Program^ 1983-84 ) » 
with a larger standard deviation (145) than reported for ill candidates 
(130). The mean GRE quantitative score (635), however , is substantially 
higher than the mean GRE quantitative score (537) for all candidates, with 
a lower standard deviation (114) than for all 1981-1983 candidates (137)« 
These GRE quantitative scores probably reflect the level and kind of 
preparation of the candidates In the sample for this study, since a large 
number of the students indiceted plans to major in the hard sciences and 
engineering in graduate school. In fact, the GRE quantitative average test 
scores of 1981-1983 examinees intending to major in the biological sciences 
(bioscience subtotal mean of 580) and the physical sciences (subtotal mean 
of 628) are higher in general than the average test scores of intended 
humanities (means ranging from 458 to 521) and social sciences (means 
ranging from 434 to 603) majors. The mean GRE analytical score (488) for 
this sample is slightly lower than for all examinees (501), with a slightly 
lower standard deviation (120) than for all lCCl-1983 examinees (127)« The 
mean for GRE analytical more closely approximates the average GRE 
analytical test scores (491-528) of examinees who intended to major in the 
social sciences. Thus we observe a somewhat different pattern of scores 
for the GRE General Test for this sample than would be expected if the 
sample had consisted of predominantly native speakers of English* Where 
the TOEFL score means tended to be slightly higher than for the average 
TOEFL candidate population, the GRE scores do not reflect this pattern 
because the GF£ norming sample is composed mostly of native speakers of 
English, whereas the TOEFL is normed on an ESL group. 

Table 3 compares scores on the sections of the GRE General Test 
obtained by the foreign and United States candidates. The mean (551.46) of 
tha GRE verbal scores for the students for whom English is their primary 
language (United States) is substantially higher than the mean (320.30) for 
the foreign candidates for whom English is not their primary language. The 
standard deviation (91.61) of the scores for the foreign group is 
considerably smaller than the standard deviation (121.29) for the United 
States group as well. This result is not surprising, of course, since 
English language pro?iciency is evaluated in the GRE verbal sections. The 
mean (567.71) of the GRE quantitative scores for the United States 
candidates is considerably lower than the mean (660.81) for the foreign 
candidates, probably reflecting again the large number of foreign students 
who plan to major in, and have prepared for, the hard science and 
engineering rields. finally, the mean (583.33) of the GRE analytical 
scores for the United States candidates is considerably higher than the 
mean (450.48) for the foreign candidates; however, the difference between 
the two groups is not as striking as the difference between reans on the 
GRE verbal. Since the GRE analytical is considered to be confounded with 
verbal ability (in English), this result suggests that the GRE analytical 
also assesses some form of analytical reasoning ability that is not as 
entirely dependent on English language proficiency as evidenced in the 
GRE verbal sections. 
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LSAT Writing Test Scores 

The means of the scores obtained by the United States candidates on the 
60~ltem indirect measure of writing ability also are shown in Table 3« For 
both sections of the testy the candidates appeared to perform equally well 
on the usage items and the sentence correction items; the mean score for 
each section reflects approximately 60 percent correct answers* The mean 
score on the usage section for this sample of students was 21«05y with a 
fit.'^ndard devitttion of 6«62« This mean Is only slightly higher than the 
mean score of 20*97 obtained on this form by a population of 2300 LSAT 
candidates in 1979^ with a standard deviation of 6*20 (internal dvcument)* 
The mean score on the sentence correction section for the LSAT candidates 
was 14*46, with a standard deviation of 4*14; performance on this section 
by our U*S* sample , who obtained a mean score of 14«72y with a standard 
deviation of 4*32, was approximately equivalent* Thus this sample of U*S* 
candidates performed on an indirect measure of writing ability at levels 
represented by a somewhat selective group of graduate-level candidates for 
law school* 



Writing Sample Scores 

The means and etandarc' deviations of the writing sample scores are 
presented In Tables 3 and 4* 

Means and Standard Deviations — Foreign Sample 

To facilitate cross-task comparisons, only subjects with complete data on 
both the writing samples and the TOEFL were Included In these analyses* The 
writing samples were assigned ratings on a one through six scale* The means 
of the scores reported In Table 4 were averaged over two readers* For every 
writing sample score, the means are lowest for the Arabic sample. In the 
middle for the Chinese sample, and highest for the Spanish sample* When the 
holistic score means are compared with the discouree/s3ntence score means, the 
two scoring methods essentially yield the same mean levels of performance* In 
addition, except for level difference between language groups, the mean 
writing sample scores for the different topics are approximately equivalent* 
This result suggests that (1) the different topics did not elicit 
qualitatively different writing performance, and/or (2) the readers maintained 
a comparable scale for evaluating the writing samples, despite possible 
performance fluctuations from topic to topic* 

Means and Standard Deviations — English-Speaking U.S. Sample 

The data summarized in Table 3 also Include only subjects with complete 
data. This table compares the score differences between the International and 
United States candidates* The mean (20.53) of the holistic writing sample 
scores for the English-speaking group Is considerably hlguer than the mean 
(12.56) for the foreign group, but the scores have approximately the same 
standard deviation* Thus the average score on papers written on one topic for 
the United States candidates Is five; for the International candidates, three 
(on the scale of one through six). 
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The means of the discourse/sentence scores for the United States group also 
are highev than the means for the foreign group and with approximately 
equivalent standard deviations. Thus the average score on papers written on 
one topic for the United States candidates is five; for the foreign 
candidates, three. The two scoring methods clearly did not yield different 
evaluations of the average level of quality of the pacers for these two 
i^roups* 



Estimates of Score Reliability for Writing Samples 

Reliability of Hol^.stic Scores 

Reliability coefficients reflect th'* extent to which a test provides 
consistent results. Tue relldblllty coefficient is a generic nerm. Different 
reliability coefficients can be bas^JI on various types of evidcmce, with each 
type of evidence having a different meaning • For the current iitudy, evidence 
for the reliability of the scores on the direct measures of writing 
performance Involved the several sources of error that may reduce the 
reliability of scores assigned to these measures — consistency of writing 
sample scores across readers (raters), across topics within topic types, and 
across topic types* 

Interraner reliability 

Each paper was read initially by two r'^aders* If the ratings assigned by 
the two readers were more than two points apart, the paper was read by a third 
reader, and the most discrepant rating was dropped* As indicated in Table 
V-1, there were relatively few cases where the readers were more than two 

Tabic V-1 

Papers with Discrepant 
Holistic Scorej 



Percent of papers with disagreement 
of more than 2 points between 
Topics reader 1 and reader 2 

Compare /Contrast — Space 2.6Z 

Compare/ Contrast — Recreation 2*4% 

Chart/Graph— Continents 3*3% 

Chart /Graph— Farming 2*6% 



points apart in their holistic Judgments. Out of the 2552 pairs of Judgments 
(638 students X A essays each), dis-rrepancies greater than two points were 
found in only 56 (2*2%) of the cases* Correl*«-^-"^8 between the ratings 
assigned by the original two readers (l*e*, bi eliminating any discrepant 
scores) are presented in Table V-2* 
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Table V-2 

Interrater Correlations of 
Holistic Scores 



Topics jr Spearman'-Brovn corrected r 

Compare/Contrast — Space .74 .85 

Compare/Contrast — Recreation .71 .83 

Chart/Graph — Continents .66 .80 

Chart/Graph — Farming .73 .84 

The interrater rallabllltles were consistently high for all topics and appear 
to represent about the best that can be expected with complex Judgments of 
this type (Breland & Jones, 1982). The uncorrected correlation coefficient Is 
an estimate of the reliability if only the scorer from one Judge are to be 
used operationally; If two Judges are to be used, the Spearman-Brown 
correction provides an estimate of the reliability of the scores based on 
suimnlng the Judgments of two raters. Although the precise numerical impact of 
using a third reader to adjudicate score discrepancies of more than two points 
cannot be directly estlmaf^d, the values In TaDle V-2 may be taken as a lower 
bound for the reliability of the adjudicated scores. In all subsequent 
analyses in this section, the adjudicated scores were used. However, note 
that because of the small number of scores that were changed, it would make 
very little difference whether adjudicated or unadjudlcated scores w^re used. 

Reliability across topics 

In addition to disagreements between raters, another source of incon- 
sistency in writing sample scores may be differential student performance on 
different topics. Some students may find some topics easier than others, or 
certain topics may demand different kinds f discourse skills that also elicit 
differential performance. The intercorrelatlons among the holistic scores on 
the four topics are presented in Table V-3. 

Table V-3 

Acroos-Toplc Correlations 
Among Holistic Scores 

1 2 3 

1 Compare/Contrast — Space .71 .69 .73 

2 Compare/Contrast — Recreation .85 .66 .71 

3 Chart/Graph — Continents .84 .81 .68 

4 Chart/Craph—Faiming .86 .85 .83 

e: Correlations under the diagonal are corrected for interrater 
unreliability. 
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Correlatlons under the diagonal are estimates of what the correlation among 
topics would be if readers were perfectly reliable. Note that both the 
corrected and uncorrected coefficients Indicate that correlations are no 
higher within topic types than across topic types. Thus, for example » compare/ 
contrast topic 1 Is not more highly correlated with the other compare/contrast 
topic than It is with the chart/graph topics. This suggests that, at least 
for these topics, there are not systematic differences in the way each topic 
type ranks students. The correlation of .83 between total score for one topic 
type (formed by adding the two scores for the topic type) with the total score 
for the other topic type is consistent with this suggestion. When collected 
for unreliability, .he correlation between the totals for the two topic types 
is approximately 1.0.* 

Reliability within language groups 

The above reliability analyses were repeated in each of the four language 
groups. Because of greater score homogeneity within groups, the correlations 
were slightly lower, but the patterns were remarkably stable. All the 
generalizations made about score relationships in the total group apply to the 
subgroup analyses. For example, the correlation between the compare/contrast 
total and the chart/graph total (.83 in the total sample) was .72, .75^ .84, 
and .69 in the Spanish, Chinese, Arabic, and English samples, respectively. 
The above correlations estimate the reliability of half of the test. They may 
be corrected by the Spearman-Brown formula to estimate the reliability of the 
entire writing sample. The estimated reliability for all four writing samples 
is .91 for the combined language groups; the estimated reliabilities are .84, 
.76, .91, and .82 in the Spanish, Chinese, Arabic, and Engliah samples, 
respectively^ The reliability in the English sample is remarkably high, given 
the celling level performance of many students in this group. 



High estimates of reader reliabilities for holistic and discourse/ 
sentence scores assigned by different readers to the same and different 
topics indicate that readers are able to reach considerable agreement on 
the relative quality of a set of papers they are judging. However, this 
evidence does not indicate whether different readers are evaluating the 
same features of writing or whether they are attending to different 
features when making decisions to assign a specific score to writing 
samples that address different topics (content) and require different 
approaches to the task (e.g., compare/contrast vs. chart/graph). During 
the pretest readings, pilot test samples elicited by the different topics 
elicited apparent differences. Although we have no means by which to 
establish that the readers adjusted their standards with respect to the 
specific features, depending on the specific topic and its task demands, 
the possibility cannot be rejected. The responses of readers to the 
questionnaires, reported in the nex*: section, and the Writer's Workbench 
analyses, summarized at the end of this chapter, offer some insights about 
the features of writing sat^ples to which the readers may have been 
attending. 
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Reliability of Discourse- and Sentence-Level (D/S) Scores 

One compare/ contrast topic (Space) and one chart/graph topic (Farming) 
were scored separately for discourse-level characteristics and sentence-level 
characteristics • 

Interrater reliability 

Table V-4 presents the percent of papers on which the two reader's 
disagreed by more than 2 points on either the discourse-level or sentence- 
level scores. 

Table V-4 
Papers with Discrepant D/S Scores 



Scores 

Space — Discourse level 
Farming—Discourse level 
Space- Sentence level 
Farming — Sentence level 



Percent of Papers with Disagreement 
of More Than 2 Points 

2.4Z 

2.8Z 

4.1Z 

1.9Z 



The corresponding correlations between readers are presented In Table V-5. 

Table V-5 

Interrater Correlations of D/S Scores 



Scores 




Space — Discourse level 


.66 


Farming—Discourse level 


.72 


Space — Sentence level 


.71 


Farming — Sentence level 


.72 



Spearman-Brorfn Corrected r 
.80 
.84 
.83 
.84 



The reliabilities for the sentence-level and discourse-level scores are 
essentially Identical and are also comparable to the Interrater reliabilities 
of the holistic scores. 
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Reliability across score types and across topics 

Table V-6 permits comparison of the relationship between discourse-level 
and sentence-level scores, both within topics and across topics* 

Table V-6 

Correlations Between Discourse and Sentence Level 
Scores Within and Across Topics 



1 Space — Discourse level 

2 Farming — Discourse level 

3 Space — Sentence level 

4 Farming — Sentence level «88 1.02 .87 



Note: Correlations under the diagonal are corrected for Interrater 
unreliability* 



The highest correlations were across score types within topic* Thus, for 
example, the discourse-level score from the Space topic correlates more highly 
with the sentence-level score frjm the same topic than It does with the 
discourse-level score on the Farming topic* This pattern aay be partially 
explained by the scoring s tegy In which the same reader assigned both a 
discourse-level and a sentence-level score at the same time (which also may 
explain the correlat;lons greater than 1 In the corrected correlations under 
the diagonal)* Howaver, It legitimately suggests that an operational program 
would gain nothing from a two-score system, at least If both scores are 
assigned by the same rater* 

Summary scores were formed by adding the two discourse-level scores to 
form a discourse total and the two sentence-level scores to form a sentence 
total* The correlation of the discourse total with the sentence total was 
*Wi further reinforcing the view that the two scores can be tieated 
essentially interchangeably* Furthermore, the discourse total correlated *87 
with a holistic total formed by adding the holistic scores on the satue two 
essays (Space and Faiming), and the sentence total correlated ,88 with the 
holistic score* Thus, the discourse-level score, the sentence-level score, 
and the holistic score all appear to be assessing the same underlying writing 
skill* 

Reliability within language groups 

As with the I Istic scores » the pattern of correlations within subgroups 
for the discourse, . aptence scores paralleled the findings in the group es a 



*81 
1*06 



2 

.66 
.75 



3 

*87 
*63 



4 

*72 
*86 

.73 
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whole. Correlations of the two discourse-level scores, the two sentence-level 
scores, and the discourse-level total with the sentence-level total for each 
language group are presented in table V-7* 

Table V-7 



Correlation of D/S Scores 
Across Language Groups 





D-Level 


S-Level 


D-'^^tal 






Space vs 


Space vs 


vs 




Language Group 


Farming 


Faming 


S-Total 


N 


Arabic 


.60 


.63 


.88 


138 


Chinese 


.54 


.61 


.87 


230 


Spanish 


.54 


.60 


.85 


174 


English 


.23 


.18 


.66 


42 


Total 


.66 


.73 


.90 


585 



Except for the low reliability in the native English sample (where many scores 
were at ceiling levels), reliability of the scores was remarkably consistent 
across groups. 

Reliability across ESL and English readers 

for the interrater reliability estimates presented in Tables V-2 and V-5, 
score 1 was the first score assigned to the writing sample and score 2 was the 
second score assigned. Score 1 could be either assigned by an ESL reader or a 
regular English teacher reader with score 2 then being from a rater in the 
other group. For the analyses in this section, interrater reliabilities were 
recalculated so that the first score in each pair was the score assigned by 
the ESL reader and the second score was assigned by the English teacher 
reaaer. If ESL readers assigned scores that were systematically higher or 
lower than the English teacher readers, then the recalculated interrater 
reliabilities could be higher than the originally calculated reliabilities. 
However, this was not the case. As is evi<*ent in Tables V-8 and V-9, the mean 
scores assigned by the two types of readers were nearly identical, and the 
interrater correlations were very similar to those reported in Tables V-2 and 
V-5. Thus, the careful training procedures employed in this study were 
sufficient to overcome any differences in rating strategies between the two 
types of readers that might otherwise have occurred. 
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Table V-8 

Means, Standard Deviations, and Correlations for Holistic Ratings 
by ESL and English Teacher Raters 
(N-638) 



Topics 


Reader 


M 


SD 


Compa r e / Con t r as t — S pa ce 


ESL 


3.3 


1.4 




English 


3.2 


1.4 


Compare/Contrast — Recreation 


ESL 


3.5 


1.4 




English 


3.4 


1.3 


Chart /Graph — Continents 


ESL 


3.4 


1.3 




English 


3.1 


1.3 


Chart/Graph — Farming 


ESL 


3.3 


1.4 




English 


3.4 


1.4 



.67 



.70 



.67 



.72 



Table V-9 

Means, Stan'^'trd Deviations, and Correlations for 
Ratings .J ESL and English Teacher Raters 
(N-238) 



Topics 
Space — Discourse level 



Reader 



M 

3.5 



ESL 

English 3.5 



SD 
1.5 
1 



.65 



Farming — Discourse level 



ESL 



3.4 1.^ 



English 3.5 1.3 



.70 



Space — Sentence level 



ESL 3.2 1.5 

English 3.0 1.5 



.71 



Farming — Sentence level 



ESL 3.3 1.5 

English 3.1 1.4 



.72 
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Reader ResponseE to Weekend Reading Questionnaires 

To obtain information about the points of view held by readers with 
regard to the evaluation of writing skills and their exposure to different 
methods of scoring papers on the same topics, the readers were asked to 
respond to two questionnaires during the essay reading weekend. The first 
questionnaire (Saturday) wac completed at th^ conclusion of the holistic 
scoring session. The second questionnaire (Sunday) was completed at the 
conclusion of the discourse/sentence scoring session. 

The first section of each questionnaire consisted of a checklist of 
identical features relevant to the evaluation of written assignments. On 
the Saturday questionnaire, readers were asked to rate the degree of 
importance they attribute to the 13 features of written assignments in 
their actual practice outside of formal reading sessions (e.g*, in the 
classroom or writing center). On the second page of the Saturday 
questionnaire, they also were asked to rate the degree of importance they 
attributed to the same features during the holistic readings. On the 
Sunday questionnaire, the same checklist was repeated, on which readers 
rated the features with regard to degree of importance attributed to the 
features during the discourse/sentence readings. The ratings reported by 
all reader-^ who completed the questionnaires, including chief readers and 
table leaders, appear in Table 5. Some of the most salient responses 
indicated the following: 

o TVie readers assigned high importance ratings (5) to some features 
they perceived they attended to, both prior to the readings and 
during the discourse/sentence readings: mastery of the conventions 
of grammar, quality of sentence structure, quality of paragraph 
organization, and addressing the topic. The means for these 
features reflect a similat pattern, although with somewhat more 
importance given to the features either prior to the readings or 
during the discourse/sentence readings. These responses suggest 
that the readers felt that certain features they regard as 
significant in practice are features to which they felt they had 
attributed significance during the discourse/sentence readings. 
In fact, their subjective reactions during discussions and 
conversations suggested that they perceived the discourse/sentence 
scoring to be more "realistic. " Th readers may have perceived 
that they were evaluating the f eati res of the papers somewhat 
differently during the holistic scoring and the discourse/sentence 
scoring; however, the means and standard deviations of rhe 3Cores 
for the papers do not support that the different scoring methods 
resulted in different levels of scores for the papers. 

o The readers rated the feature, quality of overall paper organi- 
zation, to be of greater importance during the discourse/sentence 
scoring than in other instances. This perception may ha 
resulted from the division of ratings in the two-score method, in 
which discourse-level characteristics were evaluated separately 
from sentence-level characteristics. Overall paper organization 
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was likely to be ona of the discourse-level characteristics on 
which readers focused — although holistic scoring also places 
considerable emphasis on this feature* 

o Finally » some features received higher Importance ratings for 

the evaluation of papers prior to the reading sessions: quality 
of content » development of Ideas^ adopting a tone •••appropriate 
to the audience » and appropriately meeting assigiment require- 
ments • These features, of course, justifiably are of aore 
Importance to classroom asslgnmeits^ The ratings for these 
particular features probably also reflect the appreciation, which 
emerged during the training discussions, that the readers should 
evaluate the quality of the papers within the context In which 
they were written (e^g^, time limits, possible lack of familiarity 
with the content of the topic, cross-cultural differences in 
presenting ideas and communicating tone)^ 

Tables 6, 7, and 8 present the reader responses to the same sections of 
the questionnaires, criteria used to evaluate written assignments, with a 
breakdown comparing all readers, ESL readers, and English readers, for 
tV«lr ratings bajed on perceptions of the features prior to the reading, 
during the holistic scoring, and during the discourse/sentence scoring, 
respectively • When the perceptions of the Importance of these features to 
ESL and English readers are compared, essentially no differences appear, as 
reported in any of the three tables • 

Tables 9, 10, and 11 summarize the readers* responses to the questions 
on the two final sections of the questionnaires that focused on the scoring 
methods • Table 9 compares responses to the same quest 1 answered both 
after the holistic scoring and after the discourse/ sent ace scoring, 
supplied by all readers • Questions 6 and 7 appeared only on the Sunday 
questionnaire, since they asked readers to compare the two scoring methods • 
The responses indicate the following: 

o Many readers (70 percent) felt that holistic scoring can be used 
appropriately in the classroom, but only 57 percent responded 
positively to the use of discourse/sentence scoring in the 
classroom^ 

o A considerable number of readers (80 percent) felt that the 
scores they were asked to assign during both scoring sessions 
were appropriate for the particular sample of papers • 

o A large percentage of readers (60 percent) felt that it was 

possible to make clear distinctions between the papers at adjacent 
score Intervals durin*' the holistic scoring; however, fewer 
readers (45 percent) e comfortable with the discourse/sentence 
scoring in this regards Many readers also Informally reported the 
latter reaction to us^ 
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o Only half of the readers felt that it might be possible to assign 
descriptions to each of the score Intervals used during both the 
holistic and discourse/sentence scoring* The conoients of many 
readers » both Informally » and as reported in their comments on the 
questionnaires » Indicated that they would feel uncomfortable 
attempting to assign descriptions to the score levels because 
individual papers at one score level can differ considerably but 
deserve an equivalent rating. The reader comments Indicated, 
however, that sample papers at each score level could be useful 
and meaningful, both to other readers and to those who would 
Interpret writing sample scores. 

o Questions 6 and 7, asked on the Sunday questionnaire, reflect 
the readers' generally positive attitude toward the scoring 
methods that had been applied to the papers* 

Tables 10 and 11 provide the breakdowns to the same questions, 
comparing the responses of all reac ^s, ESL readers, and English readers, 
on the Saturday and Sunday quest' 3nna.l res, respectively. The reactions of 
the ESL and English readers do not appear to differ and reflect the same 
pattern of responses as summarized above. 



Correlations of Holistic Scores with Ratings Subject Hatter Experts 

As noted above, the ratings of English teachers and ESL teachers agreed 
very well. But would Judgment of professors In the substantive areas of 
social sciences and engineering agree with the Judgments of the ESL and 
regular English teachers, especially If no special training for the 
professors was conducted? Each of fovr social science professors and each 
of four engineering professors rated (on a 1^ scale) 90 writing samples on 
the Space topic. Judgments over the four professors were averaged to form 
a mean social science Judgment; similarly the ratings of the four 
engineering professors formed a mean engineering Judgment. The mean social 
science judgment correlated .86 with the holistic score that had been 
assigned during the regular scoring sessloti, and the mean engineering 
judgment correlated .92 with the holistic score. The social science 
judgment and the engineering j tr*gment correlated .92. A similar pattern 
was observed for the sample of 93 essays on the Farning topic that were 
rated by the subject matter professors. The mean social science Judgment 
correlated .83 with the holistic score, and the engineering Judgment 
correlated .82. The intercorrelation of the engineering and social science 
judgments was .92. Whatever differences in the perception of good writing 
may exipt among regular English teachers, ESL teachers, social science 
ceachers, and engineering teachers, these differences do not interfere with 
the ability of these diverse groups to rank students* writing samples in 
the same order. 

The judges also were asked, after rating each set of papers on one 
topic, to indicate the rating that reflects the minimal level of %rriti^g 
competence acceptable for beginning students in their departments. For the 
Space topic, four judges indicated that a rating of 4 would be acceptable, 
and two judges indicated acceptability ratings of 3 and 5. For the Farming 
topic, most judges (six) found a rating of 4 to be acceptable, with one 
Judge Indicating a 3, and the other judge, a 2« 
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Exploratory and Confirmatory Factor Analyses 

A series of principal axes factor analyses wlch varlmax rotations were 
conducted to generate hypotheses about the factor structure of the data. 
The data that were factor analyzed Initially consisted of the correlation 
matrix of the 11 variables that represented complete data for the majority 
of the subjects In the sample: scores on the three sections of the TOEFL, 
holistic scores assigned to papers on each of the four topics, and 
discourse/ sentence scores assigned to papers on each of two topics. These 
analyses were conducted ^or the total sample of subjects (560) and for each 
of the three non-Engllsh-language groups, Arabic (139), Chinese (230), and 
Spanish (191). Several factor analyses (principle components) were 
conducted using the 11 variables. However, because high correlations 
between the hollsclc scores and the discourse/sentence scores Indicated 
that the discourse/ sentence scores did not represent Independent 
Information, they were omitted from the analysis. Thus, the final factor 
analysis consisted of the four holistic scores and the three TOEFL scores. 
The different analyses Indicated that the data were not likely to yield 
more than three factors. The factor analyses of the seven variables 
suggested that two factors appeared to achieve a more satisfactory fit to 
the data. The two-factor varlmax solution for the total sample accounted 
for 77 percent of the total variance. Subsequently, a promax factor 
analysis with oblique rotations was conducted, using the same data; this 
analysis suggested that the two factors were substantially correlated, but 
the promax factors did not achieve a better fit to the data than the 
varlmax factor analysis. 

The two-factor vailmax solution resulted In what appear to be "method" 
factors (Table 12). One factor consists of the scores on the three 
sections of the TOEFL and the other factor of holistic scores on the four 
topics. The factors obtained for each non-English language group are pre- 
sented In Tables 13 to 15. For this sample of data, the method of 
assessment appears to Influence performance more strongly than the four 
different writing sample topics or types (compare/contrast and chart/- 
graph) or than the three modes of English proficiency measured by the 
TOEFL (listening comprehension, writing ability, reading comprehension). 
One Interpretation suggests that performance on measures of English 
language proficiency becomes more differentiated when English proficiency 
measures require a candidate to respond by applying different cognitive 
processes — recognition vs. production. 

Because the scores on the variables (TOEFL and holistic writing sample 
scores) were highly correlated (Table 16), the question still remained — 
whether a twor-f actor solution achieved a better fit to the data than a 
one-factor solution. A maximum likelihood factor analysis (LISREL) was 
conducted to determine whether the two factors would reproduce the orig- 
inal variance/ covarlance matrix. This method permits {in fact, requires) 
specification of a factor model of the domain to be analyzed and provides 
a significance test to Indicate how u^ll the model fits the data. These 
features of the analysis provide a rational and statistical basis for 
choosing the most appropriate solution from among reasonable alternatives. 
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The two~factor model was specified for the seven variables In the 
principal axes analysis. The model Is revised, as necessary, on the !iasl8 
of residual correlations among variables to see If a more satisfactory fit 
to the data can be obtained. 

We limited attention to factor models having a **8lmple structure," 
allowing each score to contribute to the definition of only one factor* 
The first analysis (LISREL) held that the pattern of loading was Invar- 
iant. This analysis showed that the goodness of fit to the two-factor 
solution Is high (mean Index of .93 over the three language groups), with 
1 low root mean squared residual (mean of .24) that Indicates most of the 
observed covariances In each population are explained by the two-factor 
model. When summed across the three language group populations » the Chl- 
square (42.50 with 39 df) did not reject the hypothesized two-f^xtor 
solution. Next, a one^-f actor model was tried, but this model did not fit 
the data (Chl-square - 215.58 with 42 df). Although the one-factor model 
fit the data for the Spanish group reasonably well. It did not fit for 
either the Arabic or Chinese groups. 

The second analysis (LISREL) assumed not only the same pattern of 
loadings but also that comparable loadings are equal. This solution was 
rejected. Again, the solution fit the Spanish group the best, but did not 
fit the other groups well. Taken together, the two LISREL analyses demon- 
strated that, for the two-factor solution, the patterns are the same for 
the three language groups, but tne individual loadings on each factor may 
differ for each language group. 



Relationships of Writing Sample and TOEFL Mean Scores 

The model obtained by the factor analyses can be interpreted further by 
studying the correlational relationships between the variables contributing to 
the two factors and other test score variables investigated in this study* 

Mean writing sample scores and TOEFL scores for the foreign samples are 
presented in Table 17. To facilitate cross-task comparlslons , only subjects 
with complete data on both the writing samples and the TOEFL were included in 
these analyses. Writing sample scores are reported on a 1-6 scale (averaged 
over two raters); TOEFL scores are the standardized scores normally reported. 
The TOEFL total scores (reported in the last column cf Table 17) may be 
compared with normative data for the same language groups as reported in t||e 
TOEFL manual (1983). The manual reports TOEFL scores of 463, 503, and 504 
for Arabic, Chinese, and Spanish speaking groups, respectively. Thus, in each 
language group, the sample for the current study is above average, as 
discussed in a previous section. 



Means for the three countries contributing the majority of the Spanish 
speaking subjects for the study were higher. Means for Chile, Mexico, and 
Peru were 520, 514, and 513, respectively. 
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The pattern of means across the three language groups Is highly consis- 
tent; for every writing sample score and every TOEFL score the Arabic sample 
is the lowest, the Chinese sample is in the middle, and the Sptoish simple is 
highest. This lack of major interaction between type of score (writing sample 
or multiple-choice) and language group is consistent with the notion that both 
types of scores may be assessing, to a large extent, the same underlying 
language proficiency dimension. Nevertheless, there is some evidence that the 
between-groups differences are smaller for the essays than for the TOEFL. 

V to put both measures on a comparable scale, the writing sample 

holistic total scores and the TOEFL total scores were each separately 
standardized (z-transformed) using the mean and standard deviation of the 
total group for each measure. The resultant z-scores for each group are 
presented in Table V-9. 

Table V-9 

Standardized Writing Sample Essay and TOEFL Scores 

Arabic Chinese Spanish 

Holistic 

Writing Sample -.298 -.111 .380 

TOEFT 

Total -.636 -.067 .416 

The relative positions of the Chinese and Spanish groups are essentially 
the same on both measures, but the Arabic sample Is relatively lower on the 
TOEFL than on the essays, Rt.aders who are more comfortable with 
percentiles should note that the Arabic group is at the 38th percentile on 
the writing samplep but at only the 27th percentile on the TOEFL. In 
general, if a highly reliable measure shows some differences among groups, 
a less reliable measure would be expected to show less difference. 
However, the reliability of the writing sample total score is sufllciently 
high that reliability alone Is probably not sufficient to explain the 
observed differenced. Assuming that mor ».han a statistical artifact is 
involved, either the TOEFL may differentially assess the language 
proficiency of Arabic 8F(»akers c . the writing samples may be biased in 
favor of /irabic speakers. Further research relating both measures to 
external riteria i; needed. 

Kqlatlonship of nemographlc Variables to Writing Sample and TOEFL Scores 

Corral 2 tions were computed between the demographic variables and the 
writing ^..raple scores and TOEFL scores In order to identify which demographic 
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variables are significantly related to the criterion scores* The demographic 
variables considered were age, sex (M" 0, F" 1), undergraduate vs. graduate 
applicant (undergraduate* 1, graduate" 0), business vs. other graduate majors 
(business- 1, other" 0), ha..d science/ engineering vs* other graduate majors, 
social science vs. other majors, and self^reported number of years spent 
studying English. The statistically significant correlations are summarized 
In Table 18 for the full sample of International candidates, and In Tables 19, 
20 and 21 for the Arabic, Chinese » and Spanish samples, respectively. For 
each language group there were 91 (7 demographic variables X 13 criterion 
measures) correlations; thus some of the ""slgnlf leant** correlations may In 
fact be chance occurrences. Even If truly statistically significant, 
correlations below .25 Indicate that so little of the criterion variance Is 
explained that they have almost no practical significance. 

Across all three samples, number of years of studying English Is the one 
variable that Is consistently related to all the criterion scores. Note In 
particular that, In each sample, the correlation with the holistic total Is 
very similar to the correlation with the TOEFL total. Indicating that years of 
study of English has approximately an equal Impact on both methods of 
assessing English competence. 

The correlations In the Chinese sample must be Interpreted cautiously 
because of the split In that sample between Taiwan and Hong Kong. Most of the 
undergraduates came from Hong Kong while most graduates came from Taiwan. 
Thus, the higher scores for undergraduates (positive correlation with 
undergraduate status variables as well as the negative correlations with age) 
may be an artifact related to generally higher English competence In the 
British colony than In Taiwan. However, note that the higher scores for 
undergraduate/Hong Kong students were found consistently on the writing sample 
scores but not on any of the TOEFL scores. This may reflect a relatively 
greater emphasis on written communication skills In Hong Kong or a greater 
emphasis on TOEFL preparation In Taiwan. 

In the Arabic and Spanish samples, there was a slight trend for the 
graduates to score higher than the undergraduates, but this trend Is .^ore 
remarkable for how small It Is than for the few correlations that are 
statistically significant. 

Although there were a few significant correlations between major field 
designations and some of the criterion scores, there was no evidence that the 
writing sample or the TOEFL was more sensitive to major field differences. 
Thus, except for the differential sensitivity of the writing sample for r,he 
Chinese group noted abo/e, the available evidence suggest^: that the writing 
sample and the TOEFL arc comparably sensitive to differences la age, sex, 
undergraduatc;-graduate status, graduate major, and number of ye^^rs of studying 
English. 
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Correlatlonal Analyses 

The correlations between the scores on the various measures provide 
Itlonal Information regarding the validity of the TOEFL and GRE General 
Test scores for this sample of candidates. This section reports the 
correlations between TOEFL scores and direct measures of writing (writing 
samples) » correlations between GRE General Test scores and direct and Indlrecc 
measures of writing, and correlations between the scores on the direct 
measures obt&ined by the different scoring methods (holistic » discourse/ 
sentence, and subject matter)* The final section describes the data obtained 
on the Writer's Workbench. 

Correlations with TOEFL Scores 



Intercorrel«.vlons of the various writing sample scores with TOEFL scores 
are presente xn Table 16 • Consistent with the previous discussion of the 
lack of differentiation between holistic scores and discourse/sentence scores 
end between the two topic types, t>ie correlations of each writing sample score 
with a given TOEFL score w***^ essentially Identical. Because It Is the most 
reliable score, the total . <llstlc score correlated most highly with the 
TOEFL* The correlations of .72 beti^een the holistic total and the TOEFL total 
Indicates th^t the two measures are largely overlapping, but that the overlap 
Is not perfect. Because of the high reliabilities of both the writing sample 
holistic jotal (about .90) and the TOEFL total (about .95), correcting the 
correlation for attenuation does not substantially alter the conclusion 
(cor^^cted correlation of TOEFL and holistic total *«78)« The writing sample 
Is measuring some component of English proficiency that Is not assessed by the 
TOEFL* To better understand the degree of overlap or independence, note that 
the correlation between the holistic total and TOEFL structure and written 
expresslou (.69) is Just about the saa^e as the correlation of TOEFL listening 
comprehension and TOEFL structure and written expression (•6S). Thus, if the 
writing sample were a fourth section of the TOEFL, its relationship with the 
otl-er measures would be consistent with the degree of relationship among 
sections observed in the present test* It is Important to note that the 
writing saople measures some higher order organizational skills that even 
native speakers may find difficult* Thus^ che writing sample shou i be 
expected to tap some skills that are well beyond the minimal profic^ancy level 
emphasized in the TOEBL* 

Although the discourse and sentence scores were initially conceived as two 
separate score? that could not be summed, the high correlation between them 
suggested that a sum score, with ite increased varii^nce, might correlate more 
highly with the TOEFL total* Therefore, an additional score was created by 
adding the Mscourse total (sum of the diftcourse scores over ^wo raters on 
each of the two topics, Space and Farrlng) to the sentence tocAl* This 
diftcourse/sentence total correlated *73 with the TO£FL total and was nearly 
identical to the correlation of the holistic total with the TOEFL total (*72)* 
Ikit the discourse/sentence total was based on scores from only two essays; a 
holistic total based on holistic scotes from only those two essays correlated 
•68 with the TOEFL total* Tnus, the discourse/sentence total appears to yield 
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slightly better predictions. Additional research is needed to fully explain 
this apparent advantage. It may result simply from the increased variance of 
the discourse/sentence total; using an expanded score scale for the holistic 
Judgments might make iK^th scores more comparable. An additional consideration 
is the possibility of training effects. In this study all discourse/sentence 
scores were assigned on the second day of a two-day scoring session with 
holistic scores having been assigned on the first day. Judges were therefore 
more familiar with the range of student responses durirg the 
discourse/ sentence scoring than they had been during the holistic scoring. 
Future research ehould counterbalance the order effect or, preferably » use 
totally different groups of Judges for the two kinds of scoring. 



Intercorrelations of the various scores on the direct measures of writing 
ability (writing samples) and indirect measure of writing ability (LSAT 
writing test) appear in Table 22. For the foreign sample, the correlation ot 
scores on the writing sample with TOEFL total scores an! GRE verbal scores 
are nearly identical; scores on the writing sample could be predicted equally 
well from GRE verbal or TOEFL scores. The correlation '^f ^nriting sample 
scores with GRE verbal scores ii> substantially hAgher in the total sample than 
in the foreign sample because the United States students scored relatively 
high on both measures. The moder^tely high correlations of all writing sample 
scores with scores on GRE analytical suggest the contribution made by English 
(verbal) proficiency to the analytical section; however, this correlation, 
when compared to the correlation of the writing sample scores with GRE verbal, 
also suggests that this section assesses an ability other than one that is 
purely verbal. The low negative correlations of the writing sample scores 
with GRE quantitative scores reflect a pattern that further reinforces the 
independence of quantitative scores from verbal and analytical scores. These 
relationships are reinforced by the correlations of GRE verbal scores with 
GRE analytical scores (.62) and with GRE quantitativ4^ (^.l?) scores, as well 
as by the correlations of GRE analytical scores with GR£ quantitative scores 
(.33). Because this sample of data consists preponderantly of foreign 
students, the GRE General Test scoreii present remarkably stable patterns of 
relationships. It should be noted that the negative correlations with the 
GRE quantitative score are an artifact of a foreign sample in which candidates 
with very low GRE verbal scores may st .11 seek admission if their GRE 
quantitative scores are very h gh. In the sample of United States students, 
GRE verbal and GRE quantitative were correlated .64. 

For the foreign student sample (N « 124) , the correlations of scores on 
the TOEFL with scores on the GRE General Test, although slightly lower, repeat 
the same general patterns. GRE verbal scores are v^ore highly correlated with 
TOEFL scores, particularly with Section III (Reading Comp'-ehension) TOEFL 
scores (.72) and with the total TOEFL scove. These correlations support the 
relationship of the GRE verbal to the TOE?L as measures of English language 
prof icier ^ . Since the GRE verbal items place emphasi* '^n reading 
comprehe .<n, the correlation with Section III of the TOEFL provides further 
evide*- e that reading comprehension contribu"^«> su^^tantiall;- to th verbal 
sr :es. The low correlations of TOEFL scores with GRE Quantitative scores 
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repeat the pattern observed previously, as do the moderate correlations of 
TOEFL scores with ^he GRE analytical scores. These TOEFL/GRE correlations, 
when compared with the correlations of the three sections of the TOEFL, 
clearly Indicate that the TOEFL assesses English proficiency overall, but that 
each section of the TOEFL contributes a somewhat different measure of that 
proficiency. 

Finally, for the United States student sample (N»43,) the acores on the 
Indirect measure of writing ability present some Interesting patterns. Hiie 
high correlations are those of the GRE verbal scores with the sections of the 
LSAT writing test, whereas the lowest correlation is that with scores on the 
usage section and the GRE quantitative score. T orrelatlons of the 
quantitative and analytical sections of the GRE with the scores on the 
sentence correction section and the total score on this indirect measure are 
low. The correlations of the scores on the writing samples with the LSAT 
writing test are not as high as would be expected, the highest being the 
correlation between scores on the sentence correction section of the LSAT 
writing test and the total holistic scores (only .51). However, correlations 
may have been attentuated by the ceiling-level writing sample performance of 
many of the United State© stuJents. Although the students in the United 
States sample performed very well on the writing samples, their scores on the 
Indirect measure of writing ability did not reflect the same degree of 
-writing competence.- The high correlation of scores on GRE verbal, an 
indirect m^asMre, with one section of the Indirect measure of vrltir^ ability, 
when compared to the correlations of writing sample and LSAT writing test 
scores, suggests that the method of assessment (diract vs. ludlrect) may 
elicit different levels of performance. This difference in performance on 
direct and indirect measures of writing ability, although they may assess some 
overlapping abilities, was indicated in the two-factor solution to the TOEFL 
and writing samples scores discussed in a previous section of this chapter. 
The score differences reflect level differences, but also may be Influenced by 
the kinds of writing ahllltles that are elicited by the direct and Indirect 
measures, for which further research would be required. 

An additional analysis focused on the correlations of writing sample 
scores with item types, or item parcels, in the GRE General Test. The sample 
of subjects w?9 reduced to restrict the analysis to the three forms of the GRE 
that were taken by most of the candidates in our sample, thus eliminating 
small numbers of subjects who took other forms of the tebt. This sample of 
132 subjects consisted of 21 candidates for whom English is their primary 
language, 5 Arat^r-language candidates, 73 Chinese-language candidates, and 33 
Spanish-language candidates. The GRE score data was retrieved, and separate 
PCures were obtained for the different item types that make up the test. 
The' scores were correlated with the total holistic score, averaged over four 
Titi.ig samples, on the direct measures of writing ability (Table 23). The 
observed pattern of correlations was consistent with the relatlonshlpc. 
reported in other GRE studies. Specifically, the analytical reasoning and 
logical reasoning scores were not highly correlated (.24), and the analytical 
reasoning items were more highly correlated with the quantitative items (.46, 
*35, .50) than were the logical reasoning items (-.09. -.18, .02)> On the 
other hand, the logical reasoning items were more highly correlated with the 
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verbal Items (.65, .50 •67) than were the analytical reasoning Items (•IS, 
• 17, •24) • The holistic scores were more highly correlated (•SA) with the 
logical reasoning Items and with the three typp^ of verbal Items (•68, ^67, 
•70) than with the analytical reasoning Items (•23), This result Indicates 
that the holistic scores^ as expected, reflf^t verbal ability, as also Is 
reflected In the logical reasoning Items^ 

Table 24 reports the results of a stepwise regression analysis of these 
data, which parallel the correlational analysis • The prediction of the total 
writing samrle score Is enhanced somewhat by the addition of scores on the two 
types of verbal Items (reading comprehension and the discretes— antonyms and 
analogies) and next by scores on logical reasoning Items and verbal sentence 
completion Items • The quantitative Item types, as well as the analytical 
reasoning Items, do not contribute substantially to the holistic score • 

Writer *s Workbench Analyses 

The Writer *s Workbench, In addition to serving as a tool for editing and 
Instruction, appears to have promise as a research tool^ The relationships of 
features of writing Identified on the Workbench with other approaches to 
evaluating the features of a writing sample (e^g^, holistic t jres, error 
analyses) provide somewhat detailed evidence about these features* The data 
analyzed on the Workbench for this study suggest that certain characteristics 
of writing that are attended to by a human reader are related to, and 
therefore are likely to have Influenced, the evaluation of a piece of writing. 
These data provide some Interesting clues, which need to be Investigated with 
further research. Tb> results provide additional Information about the 
features of writing that readers may not be conscious of, but that may 
contribute to a score • This observation Is parallel to the experience of the 
readers who sensed that they were attending to somewhat different features of 
writing when applying the discourse/sentence method of jcorlng and wouM have 
expected the discourse/ senrence and holistic scoring methods to yield 
different scores • 

Tables 25 through 29 suumiarize the correlational relationships between the 
various Workbench features and TOEFL and writing sample scores. The data 
consist of four sets of writing samples that were selected to be 
representative on the basis of the range of holistic scores assigned to the 
different subsamples of the total sample — Arabic, Chinese, Spanish, English, 
graduate, undergraduate, hard science, and social science • These small 
samples appear to be representative, in that they reflect the same 
relationships that were observed for the total sample with respect to scores 
on the writing samples and sections of the TOEFL^ The correlations of the 
data on the Workbench features show one pattern that confines the validity of 
the TOEFL and the writing samplt! scores as English language proficiency 
measures — the highest correlations obtained are with the TOEFL and the 
holistic and discourse/ sentence scores (only the Farming and Space topics 
received discourse/ sentence scores), rhe scores on the writing samples 
yielded additional, relatively high correlations that were not found with the 
TOEFL« Characteristics such as "number of words,** ''number of content words," 
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''number of short sentences "number of *to be* verbs" (comparing Tables 25 
through 29) have moderate correlations with the writing sample scores. 

Scores on papers written on the different topics also yielded significant » 
though moderate correlations with somewhat different Workbench variables. For 
example » some significant correlations of holistic scores with Writer 'p 
Workbench features obtained for three of the topics were not observed for the 
Continents topic (Table 28)— -"number of short sentences, " "number of long 
sentences »" "percentage of passives^" "number of content words/ "percentage 
of prepositions" and "percentage of conjunctions." With a larger sample of 
papers, it would be worthwhile to investigate whether patterns of correlations 
of Workbench scores with writing sample scores differ in significant ways for 
papers written on different topics and in different discourse modes. Since 
the readers exhibited high agreement across topics and topic types » the 
potential finding that differential features contributed to the same numerical 
ratings would be of interest » because it would suggest that readers are able 
to adjust their standards to account for different features of writing 
elicited by different topics or modes of discourse. 

These correlations should be viewed only as descriptions of the 
relationships observed within four discrete sets of data^ however. Since a 
large number of variables were correlated and a small number of represen*-^tlve 
samples of papers were analyzed for each of the four topics » these 
correlations may have resulted largely due to chance factors. Before any 
inferences or conclusions can be drawn^ this study requires replication with 
larger samples of papers and a possible reduction in the number of variables. 
We are reporting these data because they suggest some interesting 
relationships among features of the papers and the scores caslgned to the 
papers » relationships that warrant additional exploration. Thus the results 
can be regarded as descriptive of only these particular sets of data — papers 
written in response to four specific topics » subjected to specific scoring 
procedures and systems » and within the context of the parameters of this 
research study. 

None of the individual features analyzed by the Writer's Workbench is 
highly correlated with TOEFL section scores or writing sample scores (holistic 
or discourse/sentence); therefore a specific Workbencn feature would not serve 
as a predictor of TOEFL section scores or of writing sample scores. Instead » 
the separate features of papers obtained from the Workbench system tend to 
support the notion that several different features contribute to the quality 
of a writing sample. Tables 30 through 32 report stepwise regression analyses 
conducted on the Writer's Workbench^ Section II of the TOEFL (structure and 
written expression), and writing sample score variables for these sets of 
data. 

For these analyses, we reduced the number of Writer's Worktench features 
by eliminating features that Introduced redundancy because they were highly 
correlated. For example, "percentage of short sentences" was dropped, but 
'number of short sentences" was retained. Tables 30 through 32 list, for each 
topic, the Writer's Workbench features that would contribute to the prediction 
of TOEFL Section II scores (Table 30), the holistic scores (Table 31), 
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and the discourse/sentence scores (Table 32). Some Writer's Workbench 
features, which represent the features of the writing samples In each set of 
papers, appear to contribute consistently to the prediction of any of the 
scores — features such as "number of content words'* and "number of spelling 
errors." Other, somewhat dlf.jrent fe«»tures contributed to papers written in 
response to different topics — features such as "number of long sentences" for 
the holistic scores for the Farming topic. Thus these analyses provide a 
rough approximation of the most Important Writer's Workbench correlates with 
Section II of the TOEFL and the writing sample score variibles. 

The data should be Interpreted cautiously because the Writer's Workbench 
also Is not Infallible — It Is capable only of doing counts and calculations 
based on the tangible characteristics of a paper (e.g*, word counts, 
readability formulas)* Occasionally, It Is not totally accurate, such as 
Identifying a spelling error when the word has been correctly spelled* In 
such Instances, we did not accept the output at face value* The spelling 
errors, printed on the output, were carefully checked, and correctly spelled 
words were not tallied. However, some of the Internal Judgments made by the 
programs that are not printed cannot be checked. With a recognition of Its 
limitations, the Writer's Workbench probably can be considered more reliable 
than a human Judge, particularly In cases where the features are objectively 
Identifiable and can be counted accurately by a computer progrsfv. The CSU 
x^erslon of the Workbench offers Judgmental comments to the writer regarding 
features of a paper that 23nerally are not considered "good" writing. One 
example is the overuse of "to be"* verbs. For our purposes, the counts of "to 
be" verbs provide objective data, without attaching Judgments. In fact the 
CSU staff noted that the chart/graph topics seemed to elicit more "to be" verb 
usage, which was appropriate because other verbs are not as '^ikely to be used 
in clearly describing a chart or gxaph. 
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VI. SUMMARY OF RESULTS AND CONCLUSIONS 



This resesrch generated a considerable amount of Information 
contributing to the validity of measures of English language proficiency — 
writing samples 9 the TOEFL^ and the GRE General Test. A summary of the 
major findings fellows: 

o The two scoring methods for the inrltlng samples » holistic and 

dlscour8e'*level/s3Utence'*level (D/S)^ yielded essentially the same 
mean levels of performance and were highly correlated » indicating 
that the two-score method may not provide any significant 
advantage over the one*-score method. Aside from the high 
correlations among holistic and discourse/sentence scores » we 
observed that (1) It was very difficult to select sample paprirs 
for scoring sessions that represented reliably different values of 
D and S. and (2) although readers could agree on the levels of 
performance for D and S» they perceived the constructs of discourse- 
level and 8entence*-level features to be unclear and confounded 
(thus challenging the validity of separating Judgments on the 
basis of D and S). 

o The means of the writing sample scores reflected level 

differences for the three language groups : whom English is not 
their primary language* For every %n:iblng sample score » the means 
were lo st for the Arabic sample* in the middle for the Chinese 
sample* and highest for the Spanish sample. 

o The mean holistic and discourse/sentence scores obtained by the 
sample of United States candidates on the writing samples were 
considerably higher than the mean scores for the foreign group* 
not a surprising result since the focus of the study was on 
measures that assess English language proficiency. 

o The reliabilities of all the scores assigned to the writing 
samples were remarkably high. Indicating that the consistent 
scoring of writing samples can be achieved (under the optimal 
scoring conditions described in previous chapters). The ^mrloun 
typea of evidence for reliability of the holistic scores couRleted 
of interrater reliability* reliability across topics* and 
reliability witMn language groups. For the d^-scouriie-level and 
sentence-level scores* evidence for reliability consisted of 
Interrater reliability* reliability acrosr sror^. types and across 
topics* reliability within language grr^upS; and reliability across 
ESL and English readers. 

o ^^ rrelaf.lons were as high across topic type i?.s within topic type . 
This result suggests that (1) the different topics did not elicit 
qualitatively different writing performance^ nna/ov (2) the 
readers maintained a cojiparable sca^.e tor evaluating the %rrltlng 
sampiles* despite performance f let tur.tions fr^ri topic to topic. 
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These positive .'^esults, however » should not be Interpreted as 
evidence that papers written In response to any topic or type of 
topic would yield equivalent reusability. The topics were 
selected on the basis of previous research Indicating that 
specific kinds of topics would serve as more appropriate stimuli 
to reflect the academic writing task demands experienced by 
students In higher education In the United States. Carefully 
controlled conditions of deelgn and pretestings and of scoring 
methods that emphasised functional academic English proficiency » 
would need to be replicated to attain similar results. 

Both this study and our previous survey of academic writing tasks 
have demonstrated » though » that topics designed to elicit the 
English skills of TOEFL candidates in different disciplines do not 
need to be subject-specific In order to evaluate writing 
performance effectively as long as they are within the context of 
relevant academic competencies. 

o vniatever differences in the perception of good writing may exist 
among regular English teachers » ESL teachers » social science 
teachers, and engineering teachers , these differences do not 
Interfere with the ability of these diverse groups to rank 
students* writing samples in the same order. When subject-matter 
experts In engineering and the social sciences were asked to rate 
representative subsamples of papers written in response to two 
topics^ the professors* ratings were highly correlated with each 
other — the mean social science ratings correlated .92 with the 
mean engineering ratings fojr each of the two topics. When 
compared with the holistic scores asslgried during the regular 
scoring session for the compare/contrast topic (Space) > the mean 
social science Judgment correlated .86 with the holistic scores » 
and the mean engineering Judgment, .92. For the chart/graph topic 
(Farming) » the correlations were .83 and .82» respectively. This 
outcome further supports the assumption that general agreement 
exists, even when not fnrmdlly Identified and verbalized, 
concerning standards for academic writing competence. 

These results also can be explained by two design factors: (1) the 
professors were Instructed to evaluate the papers from the 
perspective of writing competence required of students to succeed 
in their graduate-level departments » as opposed to writing 
competence in general; and (2) they were supplied with a limited 
number and representative sample of papers such that the task was 
to "^ome extent more highly structured than the task addressed by 
the holistic readers. 

o The reader responses to the questionnaires provMed information 
about the points of /lew with regard to the e^^aluation of writing 
skills and the readers* exposure to different methods of scoring 
papers on the same topics. Reader ratings of the features of 
written assignments suggested that the readers perceived that they 
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were attending to somewhat different characteristics of mrltlng 
competence during the holistic scoring than during the discourse/ 
sentence scoring* However , although the readers may have focused 
on different features , the means and standard deviations of the 
scores Indicated that the dlf^Jerent scoring methods did not yield 
different score levels. Thus the evaluations of the quality of 
'TTltlng competence were consistent , regardless of scoring method* 
These results suggest that papers that are strong on one measure 
(D) are strong on another (S), or that perceptions of D and S go 
hand In hand. This finding also supports the supposition held by 
readers of compositions that general agreement exists, even when 
not formal'"/ Identified and verbalized, concerning the standards 
for writing competence* 

Data obtained from the Writer's Workbench, as a tool for 
Investigating the features of writing samples that may be salient 
to readers, suggested that further Investigation may provide 
useful Information regarding relationships among features of the 
papers and the scores assigned to the papers* 

o In response to other questions on the questionnaire, a 

considerable number of readers (70 percent) felt that the scores 
they were asked to assign during both scoring sessions were 
appropriate to the particular sample of papers * 

o Many readers Indicated that they would be very uncomfortable 
attempting to assign descriptions to score levels because 
individual papers at one score level can differ considerably* 
Most readers appeared to agree , however, that sample papers at 
each score level could ba useful and meaningful If provided In a 
score manual for an operational writing sample testing program, 
both to other ^.'eaders of writing samples and to those who would 
interpret writing sample scores* 

o A principal axes factor analysis with varimax rotatior^ of 

holistic scores and TOEFL section scores resulted in a two-factor 
solution * The two factors appear to be method factors, one 
consisting of scores on the three sections of the TOEFL and the 
other » of holistic scores on papers written in response to the 
four topics* One interpretation of the two factors suggests that 
performance on measures of English language proficiency becomes 
more differentiated when the meas' res require a candidate to 
respond by applying different cognitive processes — recognition vs* 
production* 

o A comparison of the relationships of writing sample and TOEFL mean 
scores showed that the pattern of means across the three language 
groups is highly consistent* Tills lack of interaction betwee n 
type of score (writing sample or multiple-choice) and langua|je 
group is consistent wiuh the notion that both types of scores may 
assess » to a great extent » the ^ame underlying language 
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proflclency dimension. However, there Is some evidence that the 
between*-groups differences are smaller for the scor3s on the 
writing samples than for TOEFL scores. 

o The correlations between the holistic score total (direct evidence 
ot a productive skill) and the TOEFL total score (measures of 
receptive skills and indirect measures of writing) indicate that 
the two measures evaluate English proficiency to a considerable 
degree, but that the overlap between the two Instruments is not 
perfect « The writing sample contributes additional information 
regarding English proficiency , since a competently executed 
writing sample demonstrates the application of cognitive abilities 
far beyond the mastery of mechanics. The TOEFL Provides evidence 
of mastery of English language ski''ls, but not oi hlgher**order 
writing skills such as organlzatic . and quality of ideas. 

In addition, the relationships of the writing sample score with 
other sections of the TOEFL are consistent with the pattern of 
relationships among the TOEFL sections, such as reported in 
previous research (Pitcher & Ra, 1967; Pike, 1979), although the 
sizes of the correlations obtained in tMs c** My are somewhat 
lower. The earlier research results, however, cannot be compared 
directly with our findings becbuse of basic design differences. 
In the previous studies, the composition of the TOEFL v<^8 
different, since it was the five-secc^on version used ^.ior to 
1976. In addition, the t pics differed consi'*irably~ 
topics in Plke*s study included more explicit and restrictive 
instructions and elicited papers written in a narrative form. 
Pike also investigated three native country groups (from Chile, 
Peru, and Japan) whereas this research targeted a different 
configuration of native languages (Arabic, Chinese, Spanish). The 
consistent pattern of relationships obtained in the three studies, 
however, lend further support to the validity of the TOEFL and 
direct measiTes of writing ability. 

o For the foreign eample, the correlation of scores on the writing 
sample with the TOEFL total scores and with GRE verbal scores is 
nearly identical, indicating that the writing sample scores serve 
as an indicator of English language skills . For foreign 
candidates, however, the GRE verbal measure requires a high level 
of English proficiency in contrast to the TOEFL. 

o The correlation of writing sample scores with GRE verbal scores is 
substantially higher in the total sample than in the international 
sample because the United States students scored relatively high 
on both measures. The correlations of scores on sections of the 
GRE General Test with the TOEFL and writing sample scores present 
reuaxkably stable n atterns of relationships . 

o VRien the holistic writing sample scores, averaged over four 

topics, were related to scores on item types within the sections 
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of the GR£ General Test » the observed pattern of correlations was 
consistent with the relationships reported In other ORE studies « 
Specifically, the analytical reasoning and logical reasoning Items 
were not highly correlated, and the analytical reasoning Itens 
were more highly correlated with the quantitative Items than were 
the logical reasoning items* On the other hand, the logical 
reasoning Items were more highly correlated with the verbal Items 
than were the analytical reasoning items. The holistic scores 
were more highly correlated with the logical reasoning Items than 
%#ltb the analytical reasoning Items, further Indication that the 
holistic scores reflect verbal ability as measured by relevant 
item types in the ORE General Test. 



Conclusions 

The results suggest that, with careful topic selection and adequate 
training of raters, writing samples can provide a reliable measure of the 
English proficiency of nonnatlve speakers as well as native speakers of 
English, and that direct measures of writing performance, although 
substantially correlated with multiple-choice measures such as the TOEFL 
and 6RE General Test, contribute additional information regarding English 
proficiency. 

There was no indication oi any Important differences between the two 
topic types (chart/graph interpretation and compare /contrast) used in this 
study, however, it is li H>rtant to remember that both topic types 
represent structured, academically oriented writing; results may have been 
different with a "What I did last sumcer** type of topic. Furthermore 
although a single topic type might be all that is needed in an operational 
program, that does not imply chat a single :oplc is sufficient. Different 
topics, even within the same topic type, elicit slightly different 
performances, and tbe reliability of the total score increases as the 
number of topics sampled Increases. 

Separate scores for discourse-level and sentence-level skills do not 
appear to present any advantage over a single holistic score. Computer 
scoring of writing samples (Writer's Workbench) provides data that appear 
to be potentially useful for assisting writing instruction and in the 
development of scoring systems, but it is not a substitute for holistic 
scoring based on humap judgments* 

Writing performance clearly differs across language groups, jupt as 
TOEFL performance d^.ffers across language groups. But there is no evidence 
t\ it the writing samples unfairly discriminate against any group. Again, 
careful topic selection procedures must be emphasized. Some of the topics 
rejected daring th'!^ pilot testing did indeed appear to be discriminatory. 
Further research with criterion scores that were independent of TOEFL 
scores wculd be needed to fully answer any questions of possible bias. 
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Recommendations 



From the standpoint of the TOEFL program, thi« research contributes 
valuable information regarding the potential addition of direct measures of 
academic writing ability to the TOEFL. Based on our findings, we recommend 
that the decision making regarding this issue take into account the 
following conb'iderations: 

1 . A program of topic design and development such as that used 
in this study, and involving pretestinj^ to investigate the 
efficacy of new topics and of relationships among 
performances on new topics with sections of the TOEFL, should be 
implemented. The latter objective could be met by including 
topics for pretesting d iring actual TOEFL administrations at 
selected international sites. 

If a score for the direct assessment of writing becomes 
operational in a testing program, eventually we would expect to 
observe changes in the size and, possibly, patterns of 
correlations with other sections of the test. The inclusion of a 
writing sample communicates a message about what is valued in the 
asse&sment of English proficiency* Institutions that prepare 
foreign candidates for admission to postsecondary institutions in 
the United States, as well as the candidates themselves, 
undoubtedly will take steps to meet the challenge of a direct 
measure of writing ability, resulting in observed changes in 
performance on that measure and on other measures of English 
proficiency. 

2. Additional validation research that relates performance on 
writing samples to writing performance in academic settings 
should be conducted. ' 



If direct measures of writing are implemented, one writing 
task is not necessarily a sufficient sample of writing 
performance, since it would not provide assurance that a 
candidate would perform consistently on other writing tasks. The 
results of this study could be interpreted to suggest that 
performance on one writing assignment provided valid and reliable 
information regarding performance on the other tasks; with new 
topics, a different (possibly more or less heterogeneous) 
population, n J under slightly different testing conditions, 
however, this finding would need to be demonstrated. Initially, 
a new operational program should determine that performance and 
the evaluation of that performance are consistent from one writing 
assignment to another. Ideally, eajh candidate should be required 
to respond to more than one writing assignment in the early stages 
of a direct assessment program. 
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4. The number of scorers who evaluate a paper present a significant 
operational cost consideration • At least two readers should be 
used to ensure valid and reliable scores, particularly when those 
scores may be critical to the educational progress of candidates. 
It may be possible that, after accumulating a history of highly 
correlated scores asslj;ned by two readers, the program could 
justify scoring by only one reader. 

5. Meaningful Information regarding the appropriate use and 
Intepretatlon of scores on direct measures of writing should be 
provided to those who Interpret and use these scores* The 
consensus of the readers Involved In the holistic and discourse/ 
sentence scoring sessions ladlcated that the different points on a 
score scale cannot be described In terms of the salient features 
of the papers at each level, since si many different features are 
mentally weighed In the course of making a holistic judgment, and 
these features vary from paper to paper at a particular score 
point. Instead, they recommended a manual that contains several 
benchmark papers at each point along the score scale, with 
descriptive comments accompanying each paper. Such Information 
would assist test users In making placement decisions that would 
be appropriate to the candidate and to the Institution's specific 
academic requirements. 

These general recommendations represent a variety of specific operational 
Issues that will need to be resolved ^nd that do not fall within the domain 
of this stud;-, particularly the criteria for making the decision whether or 
not to Include a direct measure of writing ability as a section of the 
TOEFL. Based on the results of this research, either decision could be 
justified. 

From the standpoint of the GRE program, the data have contributed 
v^^uable Information regarding the relationships among GRE General Test 
scores, TOEFL scores, and direct measures of writing ability. These data 
contribute to the Interpetatlon of GRE score data, since considerable 
numbers of GRE candidates are nonnatlve speakers of English, and writing 
ability Is Important to the successful performance of both native and 
nonnatlve speakers In graduate-level academic contexts. 
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Table 1 



TOEFL Score Data for Total Sample of International 
Candidates and Three Language Groups 



TOEFL 

TOEFL Scores Mean SD Means* 

Total Sample (N-542) 

Section !• Listening Comprehension 51.70 7.38 — — 

Section II. Structure and Written Express. 52.03 7.23 

Section III. Reading Comprehension 52.26 6.68 

Total 519.97 64.08 



Arabic Language Croup (N-138) 



Section I. 


Listening Comprehension 


48.28 


8.02 


49 


Section II. 


Structure and Written Express. 


48.11 


7.92 


45 


Section III. 


Reading Comprehension 


47.37 


6.85 


45 


Total 




479.22 


67.62 


463 



Chinese Language Group (N>230) 



Section I. 


Listening Comprehension 


51.99 


5.62 


50 


Section II. 


Structure and Written Express. 


52.49 


5.75 


50 


Section III. 


Reading Comprehension 


52.80 


5.33 


51 


Total 




524.26 


48.62 


503 



Spanish Language Group (N>174) 



Section I. 


Listening Comprehension 


54.03 


7.90 


52 


Section II. 


Structure and Written Express. 


54.53 


7.16 


48 


Section III. 


Reading Comprehension 


55.41 


5.94 


51 


Total 




546.61 


63.46 


504 



*TOEFL score means for the separate language groups » as reported In 
the TOEfL Test and Score Manual (1983). 
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Table 2 



ERIC 



Scores on Writing Samples, TOEFL, GRE General Test, 
and LSAT Writing Test for Sanple of GRE Candidates 



Scores Mean SD II 

Writing Sample Scores (International and U. S. Candidates) 

Holistic (over four topics) 14.78 4.97 172 

Discourse-level (over two topics) 7.75 2.44 

Sentence-level (over two topics) 7.36 2.78 

TOEFL (International Candidates Only) 

Section I. Listening Coin)rehenslon 52.81 6.37 124 
Section II. Structure & Written 

Expression 54.27 5.44 

Section Ille Reading Comprehension 54.46 5.05 

Total 538.45 47.51 

GRE General Test (International and U. S. Candidates) 

GRE-Verbal 384.59 144.64 172 

GRE-Quantltatlve 634.83 114.54 

GRE-An£lytlcal 487.56 120.43 

LSAT Writing Test (U. S. Candidates Only) 

Usage Section (35 items) 21.05 6.62 43 

Sentence Correction Section (25 items) 14.72 4.32 

Total (60 items) 35.77 10.25 
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T^ble 3 



Scores on Writing Samples, TOEFL, GRE General Test, and LSAT Writing Test 
for United States and International Samples of GRE Candidates 



International United Statss 

Scoras Mean SD N Mean SO N 



Writing Sample Scores 

Holistic (over four topics) 12.56 3.30 124 20.53 3.81 48 

Discourse-level (over two topics) 6*70 1.80 10.46 1*67 

Sentence-level (over two topics) 6.05 1.85 10.73 1.72 

TOEFL (International Candidates Only) 

Section I. Listening Comprehension 52.81 6.37 124 
Section II. Structure & Written 

E)q)resslon 54.27 5.44 

Section III. Reading Comprehension 54.46 5.05 

Total 538.45 47.51 

GK£ General Test (International and U. S. Candidates) 

GRE-Verbal 320.00 91.61 124 551.46 .21.29 48 

GRE-Quantitatlve 660.81 101.07 567.71 120.91 

GRE-Analytical 450.48 98.69 583.33 119.53 

LSAT Writing Test (U. S. Candidates Chly) 

Usage Section (35 Items) 21.05 6.62 43 
Sentence Correction Section 

(25 Items) 14.72 4.32 

Total (60 Items) 35.77 10.25 
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Table 4 



Scores on Writing Samples for Total Sample 
and International Language Groi^>8 



Intl. 

Total Arabic Chinese Spanish 

Writing Sample Scores Mean SD Mean SD Mean SD Mean SD 

Holistic ficores 



Space 


3.07 


1.17 


2.C0 


1.31 


2.99 


1.06 


3.39 


1.13 


Leisure 


3.22 


1.15 


2.85 


1.22 


3.20 


1.09 


3.55 


1.08 


Farming 


3.19 


1.17 


2.^1 


1.12 


3.02 


1.14 


3.64 


1.12 


Continents 


3.14 


1.09 


2.91 


1.15 


2.98 


.99 


3.52 


1.06 


Total for four 


12.63 


3.89 


11.47 


4.18 


12.20 


3.58 


14.11 


3.62 



D/S Scores 



D score- 


— Space 


3.29 


1.17 


2.90 


1.17 


3.27 


1.09 


3.62 


1.17 


S score- 


—Space 


2.89 


1.20 


2.57 


1.28 


2.81 


1.07 


3.26 


1.22 


D score- 


—Fanning 


3.25 


1.16 


2.68 


1.13 


3.29 


1.09 


3.65 


1.12 


S score- 


—Farming 


2.96 


1.17 


2.43 


1.05 


2.95 


1.10 


3.39 


1.16 






N- 542 




N- 138 




N- 230 




N-174 
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Table 5 



Criteria Used to Evaluate Written Assignments 
Saturday and Sunday Reader Questionnaire Responses 

(In percentages of total of 50 respondents on Saturday, 
51 respondents on Sunday) 

De gree of Importance 





Low 


Hoderate 




High 


Bl«nk 


Mean 


SD 




1 






4 


5 








u^es of Written A^slffnirentfi 


















Correctness of punctuation/ 


















spelling 


















Prior to reading 


4 






24 


6 


4 


3.1 


.95 


Durlnflt holistic readlns 


12 


38 


38 


6 


2 


4 


2.5 


. 87 


During D/S reading 


8 


18 


29 


29 


12 


4 


3.2 


1.14 


Mastery of the conventions 


















of grammar 


















Prior to reading 


0 


u 




42 


20 


6 


3.9 


.74 


Durl nff holistic rpAdinff 


0 


14 


50 




A 


A 


^ 9 


• i J 


During D/S reading 


0 


6 


20 


39 


33 


2 


4.0 


.89 


Quality of sentence structure 


















Prior to reading 


0 


0 


18 


50 


28 


4 


4.1 


.69 


During holistic reading 


0 




Jo 


46 


10 


4 


3.7 


.69 


rhirlnff n/55 rpAdlnff 


0 


0 


12 




SI 


9 






Size of vocabulary 


















Prior to reading 


2 


22 


48 


20 


4 


4 


3.0 


.84 


During holistic reading 


4 


26 


42 


20 


4 


4 


3.0 


.91 


During D/S reading 


2 


18 


37 


35 


6 


2 


3.3 


.90 


Appropriateness of vocabulary 


















usage 


















Prior to reading 


0 


2 


38 


44 


10 


6 


3.7 


.70 


During holistic reading 


0 


8 


38 


42 


8 


4 


3.5 


.77 


During D/S reading 


0 


4 


20 


61 


14 


2 


3.9 


.70 


Quality of paragraph 


















organization 


















Prior to leading 


0 


4 


14 


50 


28 


4 


4.1 


.78 


During holistic reading 


0 


16 


26 


40 


14 


4 


3.5 


.94 


During V'/S reading 


0 


C 


18 


49 


29 


4 


4.1 


.70 


Quality of overall paper 


















organization 


















Prior to reading 


0 


4 


6 


48 


38 


4 


4.2 


.76 


During holistic reading 


0 


8 


24 


32 


32 


4 


3.9 


.96 


During D/S reading 


0 


0 


6 


31 


61 


2 


4.6 


.61 



ERIC 



107 



-95- 



Table 5 (continued) 



Features of Written Assignments 



Degree of Importance 
Lov Modercre High 
1 2 3 4 5 



Blank Mjan SD 



8. Quality of content 





Prior to reading 


2 


6 


14 


44 


30 


4 


4.0 


.96 




During holistic reading 


2 


14 


32 


34 


14 


4 


3.5 


.99 




During D/S reading 


0 


x2 


26 


41 


20 


2 


3.7 


.93 


9. 


Development of Ideas 




















Prior to reading 


0 


0 


16 


32 


48 


4 


4.3 


.75 




During holistic reading 


0 


4 


26 


38 


28 


4 


3.9 


.86 




During D/S reading 


0 


4 


20 


37 


37 


2 


4.1 


.86 


10. 


Overall writing ability 




















Prior to reading 


2 


0 


4 


jt 




A 
H 








During holistic reading 


0 


0 


8 


24 


66 


2 


4.6 


.e4 




During D/S reading 


0 


0 


8 


29 


59 


4 


4.5 


.65 


Meeting constraints of particular 


assignments: 












ii. 


Student addresses topic 




















adequately and directly 




















Prior to reading 


0 


2 


8 


54 


32 


4 


4.2 


.68 




During holistic reading 


4 


16 


36 


28 


12 


4 


3,3 


1.00 




During D/S reading 


2 


6 


29 


35 


26 


2 


3.8 


.98 


12. 


Student adopts a tone. 




















attitude, or style appropriate 




















to the audience 




















Prior to reading 


4 


10 


32 


38 


12 


4 


3.5 


.99 




During holistic raadlng 


12 


30 


40 


10 


4 


4 


2.6 


.98 




During D/S reading 


8 


22 


33 


31 


4 


2 


3.0 


1.02 


13. 


Student appropriately meets 




















assignment requirements 




















Prior to reading 


2 


2 


18 


44 


30 


4 


4.0 


.89 




During holistic reading 


6 


18 


32 


32 


8 


4 


3.2 


1.00 




During D/S reading 


1 


14 


33 


43 


6 


2 


3.4 


.88 
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Table 6 



Criteria Used to Evaluate Written Assignments 
Saturday Reader Questionnaire Responses 
Prior to Reading Sessions 

(In percentages of total of 50 respondents and 
24 ESL Readers, 26 Englls'^ Readers) 





Low 


Modexate 




High 


Blank 


Mean 


SD 




1 


2 


3 


4 


5 








Features of Written Assignments 


















1. Correctness of punctuation/ 


















spelling 








24 






3.1 


.95 




4 


22 


40 


e 


4 


ESL readers 


4 


25 


29 


29 


4 


8 


3.0 


1.00 


English readers 


4 


19 


50 


19 


P 


0 


3.1 


.94 


2. Mastery of the conventions 


















of grammai 














3.9 


.74 




0 


0 


32 


42 


20 


6 


ESL readers 


0 


0 


21 


42 


29 


8 


4.1 


.75 


English readers 


0 


0 


42 


42 


12 


4 


3.7 


.69 


3. Quality of sentence structure 
















.69 


All respondents 


0 


n 
u 


1 R 


50 


28 


4 


4.1 


EtOL reaaers 


o 


0 


21 


42 


29 


8 


4.1 


.75 


English readers 


0 


0 


15 


58 


27 


0 


4.1 


.65 


4. Size of V cabulary 
















.84 


All respondents 


2 


22 


48 


20 


4 


4 


3.0 


ESL readers 


4 


17 


50 


21 


8 


0 


3.0 


.78 


English readers 


0 


27 




19 


8 


0 


3.1 


.89 


5. Approprlatene»d of vocabulary 


















usage 














3.7 


.70 


All respondents 


0 


2 


38 


44 


10 


6 


ESL t rders 


0 


0 


38 


54 


0 


8 


3.6 


.50 


English readers 


0 


4 


38 


35 


19 


4 


3.7 


.84 


6. Quality of paragraph 


















organization 
















.78 


All respondents 


0 


4 


14 


50 


28 


4 


4.1 


ESL readers 


0 


4 


21 


42 


25 


8 


4.0 


.84 


English readers 


0 


4 


8 


58 


31 


0 


4.2 


.73 


7. Quality of overall paper 


















organization 














4.2 


.76 


All respondents 


0 


4 


6 


48 


38 


4 


ESL readers 


0 


8 


12 


42 


29 


8 


4.0 


.93 


English readers 


0 


0 


0 


54 


46 


0 


4.5 


.51 
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Table 6 (continued) 



Degree of Importance 

Low Moderate High Blank Mean SD 
1 2 3 A 5 



Features of Written Asslgmnents 
8. Quality of content 



All respondents 


2 


6 


14 


44 


30 


4 


4.0 


ESL readers 


4 


8 


17 


46 


17 


8 


3.7 


English readers 


0 


4 


12 


42 


42 


0 


4.2 



9« Development of Ideas 

All respondents 0 

ESL readers -0 

English readers 0 

10. Overall writing ability 

All respondents 2 

ESL readers 0 

English readers 4 



0 


16 


32 


48 


4 


4.3 


.75 


0 


21 


50 


21 


8 


4.0 


.69 


0 


12 


15 


73 


0 


4.6 


.70 


0 


4 


34 


56 


4 


4.5 


.77 


0 


8 


46 


38 


8 


4.3 


.65 


0 


0 


23 


73 


0 


4.6 


.85 



Meeting constraints of particular assignments: 



11* Student addresses topic 



adequately and directly 



All respondents 


0 


2 


8 


54 


32 


4 


4.2 


.68 


ESL readers 


0 


4 


8 


58 


21 


8 


4.0 


.72 


English readers 


0 


0 


9 


53 


42 


0 


4.3 


.63 



12* Student adopts a tone, 

attitude, or style appropriate 
to the audience 

All res, .'.dents 4 
ESL readers 4 
English readers 4 



10 


32 


38 


12 


4 


3.5 


17 


33 


29 


8 


8 


3.2 


4 


31 


46 


15 


0 


3.7 



.99 
1.02 
.94 



13. Student appropriately meets 
assignment requirements 



All respondents 


2 


2 


18 


44 


30 


4 


4.0 


.89 


ESL readers 


0 


4 


21 


38 


29 


8 


4.0 


.87 


English readers 


4 


0 


15 


50 


31 


0 


4.0 


.92 
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Table 7 

Criteria Used to Evaluate Written Assignments 
Saturday Reader Questionnaire Responses 
During Holistic Scoring 

(In percentages of total of 50 respondents and 
24 ESL Readers, 26 English Readers) 

Degree of Importance 
Low Moderate High Blank Mean SD 
1 2 3 4 5 

Features of Written Assignments 



!• Correctness of punctuation/ 
spelling 

All respondents 12 
ESL readers 17 
English readers 8 

2. Mastery of the conventions 
of grammar 

All respondents 0 
ESL readers 0 
English readers 0 

3. Quality of Sf^ntence structure 

All respondents 0 
ESL readers 0 
English readers 0 

4* Size of vocabulary 

All respondents 4 

ESL readers 4 

English readers 4 

5* Appropriateness of vocabulary 
usage 

All respondents 0 
ESL readers 0 
English readers 0 

6* Quality of paragraph 
organization 

All responJents 0 
ESL readers 0 
English readers 0 



7* Quality of overall paper 
organization 

All respondents 
ESL readers 
English readers 



38 


38 


6 


2 


4 


2.5 


.87 


42 


38 


4 


0 


0 


2.2 


.81 






o 
o 




O 
O 


/•o 


• 7/ 


14 


50 


28 


4 


4 


3.2 


.75 


17 


54 


29 


0 


0 


3.1 


.68 


12 


46 


27 


8 


8 


3.3 


.82 


2 


38 


46 


10 


4 


S.7 


.69 


0 


50 


42 


8 


0 


3.6 


.65 


4 


27 


50 


12 


8 


3.8 


.74 


26 


42 


20 


4 


4 


3.0 


.91 


17 


50 


29 


0 


0 


3.0 


.81 


35 


35 


12 


8 


8 


2.8 


1.00 


8 


38 


42 


8 


4 


3.5 


.77 


12 


46 


38 


4 


0 


3.3 


.76 


4 


31 


46 


12 


8 


3.7 


.75 


16 


26 


40 


14 


4 


3.5 


.94 


17 


33 


38 


12 


0 


3.5 


.93 


15 


19 


42 


15 


8 


3.6 


.97 



0 


8 


7'» 


32 


32 


4 


3.9 


.96 


0 


12 




21 


33 


0 


3.8 


1.07 


0 


. 4 


15 


42 


31 


8 


4.1 


.83 
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Table 7 (continued) 



Degree of Importance 

Low Moderate High Blank Mean SD 
1 2 3 4 5 

Features o f Written Assignments 
8. Quality of content 



All respondents 


2 


14 


32 


34 


14 


4 


3.5 


.99 


ESL readers 


4 


8 


46 


29 


:2 


0 


3.4 


.97 


English readers 


0 


19 


19 


38 


15 


8 


3.5 


1.00 



9« Development of Ideas 

All respondents 0 

ESL leaders '0 

English readers 0 

10. Overall irrltlng ability 

All respondents 0 

ESL readers 0 

English readers 0 



4 


26 


38 


28 


4 


3.9 


.86 


4 


33 


46 


17 


0 


3.8 


.79 


4 


19 


31 


38 


8 


4.1 


.90 


0 


8 


24 


66 


2 


4.6 


.64 


0 


17 


25 


54 


4 


4.4 


.78 


0 


0 


23 


77 


0 


4.8 


.43 



Meeting constraints of particular assignments: 

11. St*kdent addresses topic 
adequately and directly 



All respondents 


4 


16 


36 


28 


12 


4 


3.3 


1.00 


ESL readers 


4 


21 


42 


21 


12 


0 


3.2 


1.00 


English readers 


4 


12 


31 


35 


12 


8 


3.4 


1.00 



12. Student adopts a tone» 

attitude » or style appropriate 
to the audience 



All respondents 


12 


.10 


40 


10 


4 


4 


2.6 


.98 


ESL readers 


17 


29 


46 


4 


4 


0 


2.5 


.98 


English readers 


8 


31 


35 


15 


4 


8 


2.8 


.99 



13« Student appropriately meets 
assignment requirements 



All respondents 


6 


18 


32 


32 


8 


4 


3.2 


1.00 


ESL readers 


8 


21 


25 


38 


8 


0 


3.2 


1.13 


English readers 


4 


15 


38 


27 


8 


8 


3.2 


.98 
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.able 8 



Criteria Used to Evaluate Written Assignments 
Sunday Reader Questionnaire Responses 

During Discourse/Sentence Scoring 

(In percentages of total of 50 respondents and 
24 ESL Readers » 27 English Readers) 



Features of Written Assignments 

!• Correctness of punctuation/ 
spelling 



Degree of Importance 
Low Moderate High Blank Mean SD 
1 2 3 4 5 



All respondents 


8 


18 


29 


29 


12 


4 


3.2 


ESL readers 


12 


25 


25 


21 


12 


4 


3.0 


English readers 


4 


11 


33 


37 


11 


4 


3.4 



• 99 



2. Mastery of the conventions 



of grammar 



A±l respondents 


0 


6 


20 


39 


33 


2 


4.0 


.89 


ESL readers 


0 


8 


8 


46 


33 


4 


4.1 


.90 


English readers 


0 


4 


30 


33 


33 


0 


4.0 


.90 


illty of sentence structure 


















All respondents 


0 


0 


12 


35 


51 


2 


4.4 


.70 


ESL readers 


0 


0 


8 


33 


54 


4 


3.3 


.92 


English readers 


0 


0 


15 


37 


48 


0 


4.3 


.73 


;e of vocabulary 


















All respondents 


2 


18 


37 


35 


6 


2 


3.3 


.90 


ESL readers 


4 


12 


38 


38 


4 


4 


3.3 


.92 


English readers 


0 


22 


37 


33 




0 


3.3 


.90 



5. Appropriateness of vocabulary 



ubage 



All respondents 


0 


4 


20 


61 


14 


2 


3.9 


.70 


ESL readers 


0 


8 


25 


50 


12 


4 


3.7 


.82 


English readers 


0 


0 


J. 5 


70 


15 


0 


4.0 


.56 



6. Quality of paragraph 



organization 



All respondents 


0 


0 


18 


49 


29 


4 


4.1 


.70 


ESL readers 


0 


0 


21 


38 


33 


8 


4.1 


.77 


English readers 


0 


0 


15 


59 


26 


0 


4.1 


.64 



7. Quality of overall paper 
organization 

All respondents 0 
ESL readers 0 
English readers 0 

O 

ERLC 



0 


6 


31 


61 


2 


4.6 


.61 


0 


8 


25 


62 


4 


4.6 


.66 


0 


4 


37 


59 


0 


4.6 


.58 
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Table 8 (continued) 

Degree of Importance 

Low Moderate High Blank Mean SD 
1 2 3 4 5 

Features of Written Asslgnmenta 



8* Quality of content 

All respondents 0 

ESL readers 0 

English readers 0 

9* Development of Ideas 

All respondents 0 

ESL readers 0 

English readers 0 

10. Overall writing ability 

All respondents 0 
ESL readers 0 
English readers 0 

Meeting constraints of particular ass 

11. Student aodresses topic 
adequately and directly 

All respondents 2 
ESL readers 4 
English readers 0 

12. Student adopts a tone» 
attitude » or style appropriate 
to the audience 

All respondents 8 
ESL readers 12 
English readers 4 

13. Student appropriately meets 
afislgnment requirements 

All respondents 1 
ESL readers 4 
English readt^.s 0 



12 


26 


41 


2C 


2 


3.7 


.93 


17 


33 


38 


8 


4 


3.4 


.89 


7 


18 




in 
Oil 


n 
u 


A n 

•I* u 


on 


4 


?0 


37 


37 


2 


4.1 


.86 


A 


33 




1/ 






• ol 


4 


7 


33 


56 


0 


4.4 


• 80 


0 


8 


00 




A 
•1 






0 


12 


33 


46 


8 


4.4 


.73 


0 


4 


26 


70 


0 


4.7 


.56 


;nments : 












6 


29 


35 


26 


2 


3.8 


.98 


8 


29 


33 


21 


4 


3.6 


1.08 


4 


30 


37 


30 


0 


3.9 


.87 


22 


33 


31 


4 


2 


3.0 


1.02 


25 


33 


21 


4 


4 


2.8 


1.08 


18 


33 


41 


4 


0 


3.2 


.93 


14 


33 


43 


6 


2 


3.4 


.88 


21 


2? 


38 


4 


4 


3.2 


.98 


7 


37 


48 


7 


0 


3.6 


.75 
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Table 9 



Reader Responses to Questions About Scoring 
Systems on Saturday and Sunday (^estionnalras 



(in whole percentages of total of 50 respondents 
on Saturday, 51 respondents on Sunday) 



Qi'estlons 



Responses 
Yes Maybe Blank Mean SD 



1. Is this kind of scoring appropriate 
to and useful in the classroom? 

Saturday 70 16 12 2 1.4 .70 

Sunday 57 33 4 6 1.4 .58 

3. Do you feel that the scores you were 
asked to give were appropriate for 
the papers you read In this session? 

Saturday 82 8 8 2 1.2 .60 

Sunday 80 12 2 6 1.2 .43 

4. After this reading experience, do you 
feel that it is possible to make clear 
distinctions between papers at adjacent 
score Intervals? 

Saturday 60 18 8 14 1.4 .66 

Sunday 45 35 18 2 1.7 .76 



ERLC 



5. Do you feel tha"- it would be possible 
to assign descriptions to each of the 
score Intervals used...? 

Saturday 
Sunday 

Questions only on Sunday questionnaire: 

6. Are the kinds of scores we asked you 
to assign appropriate to the papers 
that were read? 

The holistic Judgments? 
The two-score Judgments? 

7. Regarding the two-score judgments , 
did you feel that they were 
Independent? 

Pertinent? 
All-inclusive? 

Should have been divided differently? 



50 


26 


10 


14 


1.5 


.70 


51 


37 


10 


2 


1.6 


.67 


82 


10 


2 


6 


1.1 


.41 


74 


16 


4 


6 


1.2 


.53 


51 


33 


10 


6 


1.6 


.68 


82 


4 


4 


10 


1.1 


.45 


63 


28 


2 


8 


1.3 


.52 


4 


78 


4 


14 


2.0 


.30 
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Table 10 



ESL and English Reader Responses to Questions 
About Scoring Systeas on Saturday Questionnaire 

(In whole percentages of total of 50 respondents; 
26 ESL, 24 English readers) 

Responses 



Questions 



1« Is this kind of scoring appropriate 
to and useful In the classroom? 
All respondents 
ESL readers 
English readers 

3. Do you feel that the scores you were 
asked to give were appropriate for 
the papers you read In this session? 
All respondents 
ESL readers 
English readers 



4. Alter this reading experience » do you 
feel that it is possible to make clear 
distinctions between papers at adjacent 
score intervals? 

All respondents 
ESL readers 
English readers 

5. Do you feel that it would .^e possible 
to assign descriptions to each of the 
score Intervals used...? 

All respondents 
ESL readers 
English readers 



Yes 


No 


Maybe Blank 


Mean 


SD 


70 


16 


12 


2 


1.4 


.70 


58 


21 


17 


4 


1.6 


.79 


81 


12 


8 


0 


1.3 


.60 


82 


8 


8 


2 


1.2 


.60 


71 


8 


17 


4 


1.4 


.79 


92 


8 


0 


0 


1.1 


.27 


60 


18 


8 


14 


1.4 


.66 


54 


21 


12 


12 


1.5 


.75 


65 


15 


4 


15 


1.3 


.55 


50 


26 


10 


14 


1.5 


.70 


67 


12 


8 


12 


1.3 


.66 


35 


39 


12 


15 


1.7 


.70 
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Table 11 



ESL and English Reader Responses to Questions 
About Scoring Systems on Sunday Questionnaires 

(in whole percentages of total of 51 respondents; 
24 ESL» 27 English readers) 





Yes 


No 


Maybe 


Blank 


Mean 


SD 


Questions 














1. Is this kind of scoring appropriate 














to and useful In the classroom? 














All respondents 


57 


33 


4 


6 


1.4 


.58 


ESL re ders 


54 


33 


4 


8 


1.5 


.60 


English readers 


59 


33 


4 


4 


1.4 


.58 



3. Do you feel that the scores you were 
asked to give were appropriate for 



the papers you read in this session? 



All respondents 


80 


12 


2 


6 


1.2 


.43 


ESL readers 


79 


12 


0 


8 


1.2 


.35 


English readers 


82 


11 


4 


4 


1.2 


.49 



4. After this reading 'Experience » do you 
feel that it is possible to make clear 
distinctions between papers at adjacent 
score intervals? 



All respondents 


45 


35 


18 


2 


1.7 


.76 


ESL readers 


46 


29 


21 


4 


1.7 


.81 


English readers 


44 


41 


15 


0 


1.7 


.72 



5. Do you feel that it would be porsible 
to assign descriptions to each of the 
score intervals used...? 



All respondents 


51 


37 


10 


2 


1.6 


.67 


ESL readers 


50 


33 


12 


4 


1.6 


.72 


English readers 


52 


41 


7 


0 


1.6 


.64 



Questions only on Sunday questionnaire: 

6. Are the kinds of scores we asked you 
to assign appropriate to the papers 
that were read? 
The holistic judgments? 



All respondents 


82 


10 


2 


6 


1.1 


.41 


ESL readers 


7J 


17 


4 


8 


1.3 


.55 


English readers 


93 


4 


0 


4 


1.0 


.20 
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Table 11 (continued) 



Q- estlons 

6. Are the kinds A f we asked you 

to assign appro ^..^ to the par^^rs 
that were read? 
The tvo'Score Judgments? 

All respondents 

ESL readers 

English readers 



Responses 
Yes No Maybe Blank Mean SD 



74 


16 


4 


6 


1.2 


,53 


71 


17 


4 


8 


1.27 


.55 


78 


15 


4 


4 


1.23 


.51 



7. Regarding the two-score judgments , 
did you feel that they were 
Independent? 

All respondents 
ESL readers 
English readers 

Pertinent? 

All respondents 
ESL readers 
English readers 

All -Inclusive? 
All respondents 
ESL readers 
English readers 

Should have been divided differently? 
All respondents 
ESL readers 
English readers 



51 


33 


10 


6 


1.6 


.68 


38 


42 


17 


4 


1.8 


.74 


63 


26 


4 


7 


1.4 


.57 


82 


4 


4 


10 


1.1 


.45 


83 


0 


8 


8 


1.2 


.59 


82 


8 


0 


11 


1.1 


.28 


63 


28 


2 


8 


1.3 


.52 


62 


25 


4 


8 


1.4 


.58 


63 


30 


0 


7 


1.3 


.48 


4 


78 


4 


14 


2.0 


.30 


4 


75 


8 


12 


2.0 


.38 


4 


82 


0 


15 


: ) 


.21 



.1^ 
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Table 12 

Factor Loadings Obtained from the Principal Axes Factor Analysis 
Seven Writing Sample and TOEFL Variables 

(N-560) 



Factor I 

Variables Loading 

Writing Samples 

Holistic score — Space •SO 

Holistic score — Leisure .78 

Holistic score — Farming •BO 

Holistic score— Continents .75 

TOEFL 

Section I» Listening Comprehension •26 

Section II. Structure and Written 

Expression .43 

Section III* Reading Comprehension .41 



Factor II 
Loading 



• 32 

• 33 
.32 
.34 

.87 

.79 
.82 
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Table 13 



Factor Loadings Obtained from the Principal Axes Factor Analysis 
Seven Writing Sample and TOEFL Variables 



Arabic language group (N>'139) 



Factor I 

Variables Loading 

Writing Samples 

Holistic score — Space .79 

Holistic score — Leisure .78 

Holistic score — Farmi^ig .84 

Holistic score — Continents .82 

TOEFL 

Section I. Listening Comprehension ,19 

Section II. Structure and Written 

Expression ^58 

Section III. Reading Comprehension .60 



Factor II 
Loading 



.37 
.42 
.25 
.23 

.93 

.66 
.69 



(Accounting for 79% of total variance) 
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Table 14 



Factor Loadings Obtained from the Principal Axes Factor Analysis 
Seven Writing Sample and TOEFL Variables 

Chinese language group (N-230) 



Factor I Factor II 

Variables Loading ^^^^ing 



Writing Samples 

Holistic score—Space 
Holistic score—Leisure 
Holistic score— Farming 
Holistic srore — Continents 



• 82 -24 

• 81 •ZA 
•82 •26 
•72 -33 



TOEFL 

Section Listening Comprehension ^20 •^^ 

Section 11^ Structure and Written 

Expression "32 •80 

Section III^ Reading Comprehension ^33 •SA 

(Accounting for 73Z of total variance) 
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Table 15 

Factor Loadings Obtained from the Principal Axes Factor Analysis 
Seven Writing Sample and TOEFL Variables 

Spanish language group (N-191) 



Factor I Factor II 

Variables Loading Loading 

Writing Samples 

Holistic score — Space .37 ,74 

Holistic score — Leisure .22 .83 

Holistic score — Farming .50 .64 

Holistic score — Continents .43 ,68 
TOEFL 

Section I. Listening Comprehension .82 .32 

Section II. Structure and Written 

Expression .80 .43 

Section III. Reading Comprehension .86 .32 



(Accounting for 74X of total variance) 
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UUe 16 

Cbrtelatlons of Ibllstlc Scores^ E/S SboraB, 
nd TOEFL Scocm 
(total saoide of 542 candidates) 



Iblistlc Soores 0/S Scatm TOEFL ScaasB 

C/C C/G Sipaoe Ttaaii% I U m 

SLFCTDSDSICSHERC 

Iblistlc GbnfMur^Cbntrast 

Leisure .65 

Iblistlc (hart/Ckflqph 

Fk^ ii« .65 .66 

(bntinents .62 .60 .61 

Ibtal holistic .86 .85 .86 .82 

DLsoxirse/Sentaioe 
^>aoe— D 
Spaced 

Farmlig— D 
Fannipg'— S 

TOOL 

I. Ustoiii^g C. 
n. S & W Bcprea 
III. teadii^C. 
Ibtal 



.74 


.^2 


.64 


.58 


.76 














.72 


.61 


.61 


.56 


.74 


.83 












.58 


.59 


.72 


.56 


.72 


.59 


.52 










.6t 


d5 


.72 


.61 


.78 


.63 


.63 


.84 








.50 


.53 


.50 


.49 


.59 


.52 


.il 


.53 


.56 






.59 


.57 


.60 


.58 


.69 


.60 


.60 


.58 


.61 


.68 




.60 


.58 


.58 


.58 


.69 


.60 


.58 


.62 


.63 


.72 


.79 


.62 


.62 


.62 


.61 


.72 


.63 


.62 


.63 


.66 


.89 


.91 
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for lUtlig SHfdc and ISL &&» 



Ifallsdc SOOBBB 



DlHoame aid SErtance leiel StaoBBS 



Space Anriig S^aoe 



Qirfxe- iVcinji Qss^xe- 
henalai B^rwlai hensfcn 



Ibtal 



Arabic 



M 2.80 2.85 
174 SD 1.31 1.22 



Chinese 

M 2.99 3.20 
230 SD 1.06 1.09 



2.91 
1.15 



2.98 
.99 



2.91 
1.12 



3.02 
1.14 



11.47 
4.14 



12.20 
3.58 



2.90 
1.17 



3.27 
1.09 



2.68 
1.13 



3.29 
1.09 



2.57 
1.2B 



2.81 
1.07 



2.43 
1.06 



2.95 
1.10 



48.28 
8.GZ 



51.99 
5.62 



48.11 
7.92 



S2.49 
5.75 



47.37 
6.85 



52.80 
5.33 



479.22 
67.62 



524.26 
48.62 



H 
H 
I 



SJpanoLsh 

M 3.39 3.55 3.52 3.64 14.11 
138 SD 1.13 1.08 1.06 1.12 3.62 



3.62 3.65 3.26 3.39 54.03 54.53 

1.17 1.12 1.22 1.16 7.90 7.16 



55.41 
5.94 



5^61 
63.46 



"BJOL 
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M 3.07 3.22 3.14 3.19 12.63 
542 SD 1.17 1.15 1.09 1.17 3.89 



3.29 3.25 2.82 2.96 51.70 52.03 

1.17 1.16 1.20 1.17 7.38 7.23 



52.26 519.97 
6.68 64.08 -Li^O 



-112- 



Table 18 

Correlations of Denographlc Variables with Holistic Scores, 
D/S Scores, and Tt EFL Scores* 

(total sample of 542 International candidates) 

r £ 

Age Numbet Years of English 

Holistic score — Farming -.15 Holistic score~Space .1^ 

Holistic score — Total -.15 Holistic score — ^Leisure .15 

Discourse score — Farming -« 18 Holistic score — Farming .11 (.05) 

TOEFL — Section I (LC) -.25 Holistic score — Continents .13 

TOEFL— Section II (S & WE)-. 12 Holistic score—Total .15 

TOEFL—Section III (RC) -.08 (.05) Sentence 8Core~Farmlng .16 

TOEFL — Total -.17 Discourse score—Farming .13 

Sentence score*-*vSpace .13 

TOEFL—Section I (LC) .20 

Sex TOEFL—Section II (S & WE) .12 

TOEFL—Section III (RC) .11 (.05) 

TOEFL—Section I (LC) .15 TOEFL— Total .16 



*Signif leant at the .01 level, unless otherwise specified 
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Table 19 



Correlations of Demographic Variables with Holistic Scores > 
D/S Scores, and TOEFL Scores* 

(sample of 138 Arabic language candidates) 



Sex 

Holistic score — Leisure .27 

Holistic score— Farming ,21 (•05) 

Holistic score — Total .24 

Sentence score — Farming .18 (#05) 

TOEFL— Section I (LC) .20 (.05) 
TOEFL~:ectlon II (S 4 WE) .21 (.05) 

TOEFL—Total .20 (.05) 



Major Field — Science 

Holistic score — Continents .29 
TOEFL—Section III (RC) .30 



Number Years of English 



Holistic score 
Holistic score- 
Holistic score- 
Holistic score- 
Holistic score- 
Discourse score 
Sentence score- 
Discourse score 
Sentence score- 
TOEFL — Section 
TOEFL— Section 
TOEFL — Section 
TOEFL—Total 



Space 
-^Leisure 
-Farming 

-Continents 
-Total 
— Farming 
-Farming 
— Space 
<Space 

I (LC) 

II (S & WE) 

III (RC) 



.29 
.25 
.27 
«30 
.32 
.34 
.38 
.36 
• 23 
.26 
.31 
.35 
.35 



Undergraduate Level 

Holistic score— Leisure -.19 (.05) 

Holistic score— Continents -.22 (.05) 

Holistic score—Total -.21 (.05) 

TOEFL— Section II (S & WE) -.19 (.05) 

TOEFL—Section III (RC) -.25 



Age 

TOEFL— Section I (LC) 



-.32 



*Signlf leant at the .01 level unless otherwise specified 
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Table 20 



Correlations of Demographic Variables with Holistic Scores » 
D/S Scores, and TOEFL Scores* 

(sample of 230 Chinese language candidates) 



Age 



Holistic score—Space -.22 

Holistic score — Leisure -.25 

Holistic score—Farming -•24 

Holistic score— Continents -.24 

Holistic score — ^total -.29 

Discourse score — Farming -.25 

Sentence score — Farming -.16(«05) 

Sentence score— Space -.17 

TOEFL— Section I (LC) -.20 

TOEFL— Section II (S & WE) -.25 

TOEFL— Section III (RC) -.19 

TOEFL— Total -.24 



Number Yaarf of English 



Holistic score— Space .22 

Holistic score— Leisure .28 

Holistic score — ^Farming .19 

Holistic scores-Continents .26 

Holistic score— Total .28 

Discourse score -**Farming .17 

Sentence score — ^Farming .25 

Discourse score— Space .15 (.05) 

TOEFL— Section I (LC) .18 

TOEFL— Section II (S & WE) .11 (.05) 

TOEFL— Section III (RC) .22 

TOEFL— Total .20 



Undergraduate Level 



Holistic score — Space .23 

Holistic score — Leisure .25 

Holistic score — Farming .22 

Holistic score — Total .26 

Discourse score — Farming .23 

Sentence score — Farming .26 

Discourse score — Space .18 (• 

Sentence score — Space .14 (. 



Hajor Field — Science 

Holistic score — Space -•22 

Holistic score — Leisure -.17 

Holistic score— Total -.18 



*Slgnlf leant at the .01 level unlsss otherwise specified 



er|c 
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Table 21 



Correlations of Demographic Variables with Holistic Scores » 
0/S Scores » and TOEFL Scores* 

(sample of 174 Spanish language candidates) 



Major Field — Science 

Sentence score — Farming 
Discourse score— Faming 
TOEFL— Section II (S £ WE) 
TOEFL— Section III (RC) 



Major Field — Business 

Sentence score — Farming 
TOEFL— Section III (RC) 



Age 

TOEFL—Section I (LC) 



.19 (.05) 
•18 (.05) 
.28 
• 22 



-.18 (.05) 
-.17 (.03) 



-.24 



Number Years of English 

Holistic score— Specs .30 

Holistic score-^Leisure .33 

Holistic score— Farming .39 

Holistic score— Continents #33 

Holistic scort— Total .41 

Sentence score — Farming .38 

Discourse score*-<^Farmlng .35 

Sentence score — Space .30 

Discourse score— Space .35 

TOEFL—Section I (LC) .55 

TOEFL—Section II (S & WE) .39 

TOEFL—Section III (RC) .41 

TOEFL— Total .50 



Undergraduate Level 
TOEFL—Section II (S & WE) -.27 



♦Significant at the .01 level unless otherwise specified 
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RIC 



Xahle22 



Qneltflas of IfaUstlc SoonB* DtS Saatm, 
IBLSaaaB, ISKrUdtli«SaacgB» aidCHBSaanB 



IbtaL 

Ifaliatic D/S&uces IHLSauBBS ISaSaxm OCSoxes 
Sbuce Mac Sat I n m Tbtal U SC Tbtal V g. 



IbCaL Diaxme .89 (.74)* 

(N-172) 

HjtaL Sentaiffi .90 (.78) .91 (.81) 

mm 

lEBPLdf-lZ^) 

I. IC (.53) (.m8) (.52) 



n. s&VE 


(.58) 




(.50) 


(.57) 


(.50) 
















m. (c 


(.52) 




(.54) 


(.53) 


(.62) 


(.60) 














Tbtal 


(.64) 




(.60) 


(.64) 


(.86) 


(.82) 


(.86) 












LSffWrttJi? (IW3) 




























.as 




.45 


.42 


















Sot. Gbrtect. 


.51 




.44 


.49 










.75 








IbCaL 






.48 


.48 


















(JE (W-172) 




























.81 


(.eo) 


.79 (.56) 


.80 (.56) 


.51 


.61 


.72 


.72 


.75 


.66 


.76 




Qjantitative 


-.22 


(-.15) 


-.20 (-.01) 


-.24 (-.06) 


.04 


.05 


.12 


.GB 


.18 




.24 


-.17 


Analytical 


.55 


(.22) 


.55 (.31) 


.52 (.18) 


.31 


.38 


.40 


.42 


.38 


.36 


.40 


.62 .33 



OS 

i 



^Scures in pEradiesBS are fur saiple cf fccEi^fi candidaleB inly 
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Table 23 

Correlations of Holistic Scores Total* and GRE Item Type Scores 

(■ample of 132 cases) 



Scores 



Hoi. SC DV RC 



GRE Scores 

Verbal Quantitative Analytical 

QC M DI AR LR 



GRE Verbal 

Sentence Completion 

(SC) .68 

Discrete Verbal (DV) .67 

Reading Comprehension 
(RC) .70 

GRE Quantitative 

Quantitative Compar- 
isons (QC) -.22 

Discrete Math (M) -.31 

Data Interpretation 



(DI) 

GRE Analytical 

Analytical Reasoning 
(AR) 

Logical Reasoning 
(LR) 



.64 

.70 .64 



-.26 -.30 -.12 

-.28 -.36 -.26 .76 



-.09 -.03 -.08 .00 .64 .59 



.23 .15 .17 .24 .46 .35 



.64 .65 .50 .67 -.09 -.18 



.50 
.02 



.24 



*Hollstlc scores averaged over four writing samples 
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Table 24 

GRE General Test Item Types Stepwise Regression Analysis 
for Holistic Score Total 
(N«132) 

Standardized 
Regression 



uKc iten rype Fredictors 


r 


weight 


F 


R 


Reading Conprehensiou 


.70 


.24 


.63 


.80 


Discrete Verbal 


.67 


.25 


.54 




Logical Reasoning 


.64 


.20 


.52 




Sentence Completion 


.68 


.19 


.62 




Data Interpretation 


-.09 


-.09 


.50 




Analytical Reasonir 


.23 


.12 


.39 




Mathematics 


-.31 


-.05 


.63 




Quantitative Comparisons 


-.22 


-.01 


.68 
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Table 25 

Significant Correlations of TOEFL Section I (Listening Comprehension) 
with Writer's Workbench Variables, TOEFL Section Scores, and 
Writing Sample Scores for Four Writing Sample Topics* 



Farming Space Continents Lelarre 
Writer's Workbench C/G C/C C/6 C/C 

Text Features (N"87) (N-81) (W"82) (N»81) 



Quality of development 


• 22 












• 22 * 


Number of spelling errors 


-.23 








*• 21 


W 




Number to chech 


• 24 




• 30 




• 31 




• 32 ** 


Variation 










-• 25 


A 




Number of short sentences 


.Zl 




• 34 


- 

WW 


• 34 


** 




Number of long sentences 


• 20 




• 30 










Percentage of simple sentences 










-• 19 


* 




Percentage of complex sentences 






• 26 


WW 


. 21 


* 




Percentage of "to be** verbs 


• 38 


** 


.22 


* 








Percentage of passives 


• 32 


** 


.27 


** 


.22 


* 


• 26 * 


Percentage of nomlnallzatlons 


->21 


* 


















.31 


** 


.28 


** 


• 34 ** 


Number of words 


• 24 


* 


.45 


*** 


.43 


*** 


•46 *** 


Average word length 






.26 


** 


.30 


** 




Number of questions 










-.20 


* 




Number of content words 


• 20 




.46 


AAA 


.42 


*** 


• 46 


Percentage of content words 


-•27 


** 












Average length of content words 






.28 


* 


.20 


* 




Percentage of prepositions 


• 19 


* 












Percentage of conjunctions 


-•24 


* 












Percentage of adverbs 










.21 


* 


• 31 ** 


Percentage of nouns 


-•30 


** 








-•22 * 


Klncald readability 


-•19 


* 












Colem':n-Llau readability 






.26 


** 


.32 


** 




Flesch readability 


-.21 


* 












Percentage of abstract words 






.31 


** 








TOEFL Section II (S 4 WE) 


• 75 


*** 




*** 


.77 


*** 


•80 *** 


TOEFL Section III (KC) 


• 72 


*** 


. :j 


*** 


.77 


*** 


•79 *** 


Holistic score for topic 


• 57 


*** 


.62 


*** 


.62 




•65 *** 


Sentence score for topic 


• 62 


*** 


.61 


*** 


(no 


score) 


(no score) 


Discourse score for topic 


• 62 


*** 


.66 


*** 


(no 


score) 


(no score) 



♦Levels of significance Indicated by asterisks:*-. 05, **-.01, ***-.001 
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Table 26 

Significant Correlations of TOEFL Section II (Structure and Written Expression) 
with Writer's Workbench Variables » TOEFL Section Scores, and 
Writing Sample Scores for Four Writing Sample Topics* 



Writer's Workbench 
Text Features 



Faming 

C/G 
(N«87) 



Space 
C/C 
(N«81) 



Continents 
C/G 
(N"82) 



Leisure 

C/C 
(N«81) 



Quality of development 
Number of f lling errors 
Percentage of vague words 
Number to check 
Variation 

Avi'rage sentence length 
Numder of short sentences 
Number of long sentences 
Percentage of •*to be" verbs 
Percentage of passives 
Percentage of nominalizations 
Number of sentences 
Number of words 
Average word length 
Number of questions 
Number of imperatives 
Number of content words 
Percentage of content words 
Average length of content words 
Percentage of prepositions 
Percentage of conjunctions 
Percentage of adverbs 
Percentage of nouns 
Percentage of pronouns 
Kincaid readability 
Auto readability 
Coleman-Liau readability 
Flesch readability 
Percentage of abstract words 

TOEFL Section I (LC) 
TOEFL Section III (RC) 

Holistic score for topic 
Sentence score for topic 
Discourse score for topic 







"• Zi 






.26 




—•Jo 




97 
• LI 




-•36 -^29 








— • JU 










.30 


WW 


9Q 

• 2o 


WW 


• Zo 


9fl 

• Zo 


itit 




9C 

-•25 




— •*/ 






-•26 


WW 


-•26 


WW 








• 27 




• 32 


WW 








mil 


WW 


• Jo 


WW 








•46 




• 20 


W 








• 30 


itit 


• 40 


AAA 




• 27 


WW 










• Z4 * 


9A 




• 23 


* 


• 37 


** 


.24 * 


.35 


** 


• 27 


** 


• 39 


** 


.36 ** 


.36 


** 


• 19 


* 


• 38 


WW 


• H«* 


9Q 

• 2o 


WW 


• 24 


* 




















— • £0 






• 24 




• 42 


www 


.37 ** 


•40 


ititit 


-•21 




.39 


WW 








.34 


WW 






.34 ** 


• 31 




.26 


** 












-.31 


** 




















.21 * 


• 24 


* 


-.19 


* 






















-•24 


* 


-.30 


** 


-.23 


* 








-.27 


** 


-.23 


* 




-•19 


* 


-.22 


* 


.33 


** 


,46 *** 


• 27 


** 












• 22 


* 


.23 


* 


.24 


* 


-.18 * 






.75 


*** 


.67 


*** 


.77 *** 


• 80 


*** 


.84 


*** 


.83 


*** 


.79 ***• 


• 86 




.67 


*** 


.62 


*** 


.69 *** 


• 64 


*** 


.70 


*** 


.68 


*** 


(no score) 


(no 


score) 


.65 


*** 


.72 


*** 


(no score) 


(no 


score) 



•Levels of significance indicated by asterisks :*-.05, **-.01, ***-.001 
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Table 27 

Significant Correlations of TOEFL Section III (Reading Comprehension) 
with Writer's Workbench Variables, TOEi''L Section Scores, and 
Writing Sample Scores for Four Writing Sample Topics* 



Writer's Workbench 
Text Features 



Farming 
C/G 



Space 
C/C 



Continents 
C/G 



Leisure 

C/C 
(N»81) 



Quality of development 
Number of suggestions 

to substitute 
Number of spelling errors 
Number to check 
Number of punctuation errors 
Variation 

Average sentence length 
Number of short sentences 
Number of long sentences 
Percentagje of ''to be- verbs 
Percentage of passives 
Percentage cf nomlnalizations 
Number of sentences 
Number of words 
Average %iord length 
Number of questions 
Number o£ Imperatives 
Number o^ content words 
Percentage of content words 
Average length of content words 
Percentage of prepositions 
Percentage of conjunctions 
Percentage of adverbs 
Percentage of nouns 
Percentage of pronouns 
Klncaid readability 
Auto readability 







-.23 


* 






7\ 

9 CL 




-.21 


* 


















9 Cf 


** 


. 40 










** 




** 


. JO 












-.18 


* 






-.22 


* 


• 19 


* 






-.24 


* 






.18 


* 


-.22 


* 










.22 


* 


.30 


** 


.21 


* 






.23 


* 


.32 


** 










.39 


** 


.27 


** 










.28 


** 


.43 


*** 






.31 


** 






.23 


** 






.29 


** 






.28 


** 






.33 


** 


.25 


** 


.34 


** 


.35 


** 


.37 


** 


.19 


* 


.44 


*** 


.44 


*** 


.41 


*** 


.18 


* 


.36 


** 


-.27 


** 














26 


** 






.22 


* 






.36 


** 


.41 


*** 


.21 


* 










.18 


* 


.37 


** 


.49 


*** 


.33 


** 


,43 


*** 


.36 


** 














.36 


** 






-.20 


* 







-.21 * 
-.19 * 



-.24 * 



.24 * 
..35 ** 



Coleman-Liau readability 










.47 


*** 


.42 


*** 


Flesch readability 














.35 


** 


Percentage of abstract words 


.21 


* 


.33 


** 


-.22 


* 






TOEFL Section I (LC) 


.72 


*** 


.73 


*** 


.77 


*** 


.79 


*** 


TOEFL Section II (S & WE) 


.84 


*** 


.83 


*** 


.79 


*** 


.86 


*★* 


Holistic score for topic 


.62 


*** 


.65 


*** 


.63 


*** 


.61 


*** 


Sentence score for topic 


.66 


*** 


.61 


*** 


(no 


score) 


(no 


score) 


Discourse score for topic 


.63 


*** 


.67 


*** 


(no 


score) 


(no 


score) 



♦Levels of significance indicated by asteriks: *«.05, **".01, ***«.001 
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Table 28 



X 



Significant Correlat'^^ns of Holistic Scores on Writing Samples 
with Writer's Workbench Variables » ^OEFL Section Scores » and 
Writing S;»aple Scores for Four Writing Sample Topics* 



Writer's Workbench 
Text Features 



Faming 

C/G 



Space 

C/C 



Continents 
C/G 



Leisure 
C/C 



Quality of development 

Number of spelling errors 

Number to check 

Number of punctuation errors 

Variation 

Average sentence length 
Number of short sentences 
Number of long sentences 
Percentage of simple sentences 
Percentage of **to be** verbs 
Percentage of passives 
Percentage of n >minalizations 
Number of sentences 
Number of vords 
Average word length 
Number of questions 
Number of content words 
Percentage of coatent words 
Average length O' content words 
Percentage of p^cipositions 
Percentage of conjunctions 
Percentage of adverbs 
Percentage of nouns 
Percentage of adjectives 
vincaid readability 
Auto readability 
Coleman-Liau readability 
Percentage of abstract words 

TOEFL Section I (LC) 
TOEFL Section II (S & WE) 
TOEFL Section III (RC) 

Sentence score for topic 
Discourse score for topic 



.31 


** 


-•20 


* 






• 26 


** 


-•28 


** 


-•35 


** 


-•26 


** 


-•26 


** 


.31 


** 


• 37 

-•:o 

-•19 
-•21 


** 
** 
* 
* 


• 33 


** 


• 41 


*** 


• 46 


** 


• 43 


*** 






• 31 


** 


• 37 


** 


• 37 

• 23 


** 
* 






• 25 


* 


• 33 




• 30 


** 


-•19 


* 






.32 


** 


• 41 


*** 






• 24 


* 


• 20 


* 






-•18 


* 






.37 


** 


• 45 


*** 


• 28 


** 


• 58 


*** 


• 49 


** 


• 56 


*** 


• 47 


*** 


• 66 


*** 


• 22 


* 


• 28 


** 


• 34 


** 






• 19 


* 














• 46 


** 


• 60 


*** 






• 69 


*** 


-•27 


* 






• 46 


*** 


• 19 


* 


• 26 


* 


• ^/ 


** 


• 30 


** 






• 38 


** 


• 26 


** 






• 19 


* 


-•21 


* 


-•21 


* 






-•22 
• 29 


* 

** 


-•21 


* 






-•19 


* 






-•19 


* 


-•24 


* 











-.23 * 











39 


** 










• 20 


* 


-•23 


* 






• 57 


*** 


• 62 


*** 


• 62 


*** 


• 65 


*** 


• 67 


*** 


• 62 


*** 


• 69 


*** 


• 64 


*** 


• 62 




• 65 


*** 


• 63 


*** 


• 61 


*** 


• 70 


*** 


• 82 


*** 


(no 


score) 


(no 


score) 


• 79 


*** 


• 83 


*** 


(no 


score) 


(no 


score) 



*Le-'el8 of significance Indicated b> asterisks:*-. 05, **«.0l, ***«.001 
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Table 29 



Significant Correlations of Discourse/Sentence Scores on Wrltlnic Saaoles 
With writer's Workbench Variables. TOEFL Section Sc^e"^nJ ' 
Writing Saaple Scores for Four Writing Saaple Itopics* 



Sentence-level 



Dlscourse-le\ .1 



Writer's Workbench 
Text Features 



rsmxng 


Space 

C/C 


Continents 
C/G 
(N«82) 


Leisure 

C/C 
<N«81) 




-•20 * 


.28 ** 


-.23 * 






-.26 ** 


-.26 ** 




-.19 * 




-.29 ** 




• 24 * 


.27 ** 


.44 *** 










» 91 * 








— e^/ 




-.22 * 


.23 * 


,40 *** 


.24 * 


,45 *** 


.31 ** 


.37 ** 


.28 ** 


.35 ** 




.21 « 




.22 * 


. Ha 


.2o 


,40 *** 


.25 * 


.32 ** 


.36 ** 


.33 ** 


.32 ** 


.22 * 


.44 ** 


.20 * 


,50 *** 




AO *** 


.34 ** 


.58 *** 




.32 ** 


.34 ** 


.32 ** 




.52 *** 


.30 ** 


,64 *** 


-.27 * 




-.28 ** 




.41 *** 


.37 ** 


.42 *** 


.37 ** 


,38 *** 


.22 * 


.36 ** 


.19 * 


-.29 ** 


-.21 * 


-.22 * 




-.23 * 




-.23 * 






-.26 ** 




-.26 ** 




-.25 ** 




-.25 * 




-.29 ** 




.29 * 




.29 






.62 *** 


,61 *** 


.62 *** 


.66 *** 


,70 *** 


,68 *** 


.65 *** 


.72 *** 


,66 


,61 *** 


,63 *** 


.67 *** 


.70 *** 


,82 *** 


.79 *** 


,83 *** 






.84 *** 


.84 *** 


,84 *** 


,84 *** 







Quality of development 

Number of spelling errors 

Percentage of vague words 

Number to check 

Number of punctuation errors 

Variation 

Average sentence length 
Number of short sentences 
Number of long sentences 
Percentage of simple sentences 
Percentage of "to be- verbs 
Percentage ot passives 
Number of sentences 
Number of words 
Average word length 
Number of content words 
Percentage of content words 
Average length of content words 
Percentage of prepositions 
Percentage of conjunctions 
Percentage o^ nouns 
Klncaid readability 
Auto readability 
Coleman-Liau readability 
Percentage of abstract words 

TOEFL Section I (LC) 
TOEFL Section II (S & WE) 
TOEFL Section III (RC) 

Holistic score for topic 
Sentence score for topic 
^•^scourse score for topic 



*Levels of significance indicated by asterisks 05, **-,0i. 



.001 
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Table 30 



Writer's Workbench Stepwise Regression Analyses for 
TOEFL Section II* Structure and Written Expression 



Independent Variables 
Writers *s Workbench 



Standardised 
Regression T 
Weight Statistic 



Mean 



SD 



Space Topic (Compare/Contrast) 

Number of content words 
Average length of content -words 
Number of suggestions-^ther 
Number of suggestions to omit 
Flesch readability formula 
Coleman readability formula 
Percentage of nouns 

R^- .70, standard error- 6.05, N- 81 

Mean for TOEFL Section II- 52.32, SD« 8.07 

Recreation Topic (Compare/Contrast) 

Number of content words 
Number of spelling errors 
Percentage of nominallzations 
Number of suggestions to substitute 
Number of questions 

R^- .63, standard error- 6.42, N- 81 

Mean for TOEFL Section II- 53.27, SD- 7.95 

Farming Topic (Chart /Graph) 

Percentage of "to be" verbs 
Number of spelling errors 
Number of short sentences 
Length of content words 
Flesch readability formula 
Percentage of prepositions 

R^- .71, standard error- 6.00, N- 87 

Mean for TOEFL Section II- 53.43, SD- 8.15 



Continents Topic (Chart/Graph) 

Coleman readability formula 
Number of words 
Number of spelling errors 
Percentage of abstract words 



R - .72, standard error- 5.76, N- 82 

Mean for TOEFL Section II- 52.83, SD- 8.02 



.44 


4.63 


108.99 


36.45 


.56 


.31 


5.89 


.39 


-.28 


-3.28 


.01 


.11 


-.27 


-2.94 


1.55 


1.44 


-.42 


-3.39 


1.12 


.28 


.66 


2.92 


2.63 


.31 


-.21 


-2.10 


9.15 


8.70 



.50 


5.27 


108.48 


40.87 


-.39 


-4.21 


4.88 


5.01 


.21 


2.28 


1.07 


1.07 


-.21 


-2.35 


.84 


1.36 


-.19 


-2.05 


.07 


.41 



.30 


3.35 


72.10 


20.12 


-.26 


-3.04 


4.51 


4.18 


.24 


2.91 


2.46 


1.57 


.28 


2.93 


5.74 


.48 


-.26 


-2.88 


1.02 


.29 


.17 


2.04 


1.29 


.31 



.40 


5.05 


.93 


.19 


.46 


5.63 


189.71 


67.85 


-.36 


-4.38 


4.30 


3.61 


-.2.' 


-2.58 


.12 


.11 
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Table 31 



Writer's Workbench Stepwise Regression Analyses for Holistic Scores 



Independent Variables 
Writers 's Workbench 



Swjndardited 
Regression T 

Weight Statistic 



Mean 



SD 



Space Topic (Compare/Contrast) 

Number of content vords 
Nunber of spelling errors 
Average length of content .words 
Nunber of suggestions to onit 

.2 



.63 
.25 
.26 
.18 



7.63 
-3.22 

3.30 
-2.19 



R - .74, standard error- I. 01, N« 81 
Mean of Space holistic scores- 3.23, SD- 1.47 



10.64 
-5.77 
-2.35 



Recreation Topic (Compare/Contrast) 

Number of concent words .73 
Number of spelling errors -.40 
Percentage of conjunctions -.16 
2 

R - .81, standard error- .91, N- 81 

Mean of Recreation holistic scores- 3.39, SD- 1.50 

Farming Topic (Chart/t^aph) 

Number of words 
Number of spelling errors 
Percentage of prepositions 
Number of long sentences 
Average length of content words 
Percentage of sentence beginnings 
2 

R - .75, standard error- .97, N- 87 
Mean of Farming holistic scores- 3.36, SD- 1.41 

Continents Topic (Chart" ph) 

Number of words 
Coleman readability formula 
Percentage of abstract words 
Number of spelling errors 
Percentage of adjectives 

2 

R - .76, itandard error- .98, N- 82 

Mean of Continents holistic scores- 3.30, SD- 1.43 



106.99 36.45 

5.33 3.72 

5.89 .39 

1.55 1.44 



108.48 40.87 
4.88 5.01 
.47 .16 



.45 


5.41 


193.79 


72.46 


-.28 


-3.56 


':.51 


4.18 


.28 


3.51 


1.29 


.31 


.26 


3.17 


1.02 


.84 


.17 


2.11 


5.74 


.48 


.16 


2.06 


62.61 


19.47 



.56 


7.16 


189.71 


67.85 


.38 


4.90 


.93 


.19 


-.26 


-3.32 


.12 


.11 


-.27 


-3.46 


4.30 


3.61 


-.17 


-2.15 


1.64 


.44 
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Tablt 32 

Writer's Workbench Stepvlse Regression Analyses for D/S ^.ores 



Standardised 

Independent Variables Regression T 

Writers 's Workbench Weight Statistic Mean SD 



Space Topic (Compare/Contrast) 
For Discourse-level scores: 



Number of ccatent words 


.66 


8.23 


108.99 


36.45 


Average length of content nords 


.40 


4.93 


5.89 


.39 


Number of suggestions to 








1.65 


subsltute 


-.21 


-2.57 


2.01 


Flesch readability formula 


-.20 


-2.41 


1.12 


.28 



R^« .76, standard error- .89, N- 81 

Mean of Space Discourse scores* 3.24, SD- 1*34 

For Sentence*level scores: 

Number of content vords .46 5.43 108.99 36.45 

Average length of content words .41 4.50 5.89 .39 

Flesch readability formula -.29 *-3.12 1.12 .28 

Punctuation -.20 -2.33 

K^m ,68, standard error* .94, N« 87 

Mean oi Space Sentence scores* 2.84, SD» 1.25 

Farming Topic (Chart/Graph) 

For DlscoursQ-level scores: 



Average length of content words 


.61 


5.52 


5.74 


.48 


Number of words 


.25 


2.95 


193.79 


72.46 


Coleman readability formula 


-.33 


-2.87 






Percentage of prepositions 


.24 


2.77 


1.29 


.31 


Percentage of nouns 


-.33 


-3.46 


2.80 


.28 


Percentage of pronouns 


-.23 


-2.31 


.56 


.26 



R^- .71, standard error- .85, N- .82 

Mean of Faming Discourse scores- 3.37, SD- 1.16 

loc Sentence-level scores: 



Percentage of "to be" verbs 


.28 


3.61 


72.10 


20.12 


Percentage of prepositions 


.27 


3.50 


1.29 


.31 


Number of long sentences 


.34 


4.62 


1.02 


.84 


Number of spelling errors 


-.23 


-2.90 


4.51 


4.18 


Average length of content words 


.20 


3.39 


5.74 


.48 


Percentage of nominalizations 


.72 


2.80 


2.80 


1.33 



2 

R - .75, standard error- 8.30 
Mean of Farming Sentence scores- 3.04, SD- 1.21 
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Appendixes 

A. Writing Assessment Test Instructions and Topics 



B> List of Readers for Reading Weekend 
Ll&t of Subject Matter Readers 



Appendix A 

Writing Assessment Test Instructions and Topics 



A-1 



TOEFL Writing Saiiq>lt 

Total TiM * 2 hours 

A Topics 
30 Minutes Per Topic 



Please PRINT the following lufonutlon: 
Name: ^ . 



Family Name 



TOEFL APPLICAnON NUMBER 



Native Country: 



First Name 



M.I, 



What major subject do you plan to study? 
How many years have you studied English? 
Please CHECK the appropriate boxes: 

Applying for admission as: 

Undergraduate student 
Graduate student 



Sex: 



Male 
Female 



E 



ERLC 



General Instructions 

You will have two hours to plan and write essays on the four topics In 
this booklet. At the end of each thlrty-mlnute period, the supervisor 
will tell you to stop writing on one topic and begin writing on the next 
topic. These topics are presented to give you an opportunity to show 
how well you can write. There are many possible responses to each topic 
but no "right" answers. What Is Important, therefore. Is that you take 
care to express your thoughts on each topic clearly and effectively. How 
well : ou write Is more Important than how much you write. However, to 
cover each topic adequately, you should write more than one paragraph. 

Write your essays In this booklet, using the lined pages that follow each 
topic. You will have enough space If you write on every line, avoid wide 
margins, and keep your handwriting to a reasonable size. You may use the 
space immedlate'iy below each topic to make notes. If you wish. 

PLEASE DO NOT OPEN THIS BOOKLET UNTIL YOU ARE TOLD TO DO SO. 
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TIME - 30 KZNUTES 



Some people My that exploration of outer epece hes neny 
edvantegee; other people feel that It la a waate of money 
and other reaourcea. Write a brief eaaay In irfilch you 
dlacuaa each of theae poaitlona. Give cne or two advan- 
tages and dlaadvantagea of apace exploration* and explain 
which position you support. 



THIS SPACE MAY BE USED FOR NOTES. 
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TIME - 30 MUniTES 



Many people enjoy active physical recreation like aporta and 
other forma of exerdae* Other people prefer Intellectual 
actlvltlea like reading or llatenlng to aualc* In a brief 
eaaay, dlacuaa one or two beneflta of phyaical actlvltlea and 
of Intellectual actlvltlea* Explain vhlch kind of recreation 
you think la more valuable to someone ycur age. 



THIS SPACE MAY BE USED FOR NOTES. 




A-4 



TIME - 30 MIHUTES 



CHANCES IN FAitKING IN TBE U.S.: 1940 - 1980 





IMO 'M •« 'It 'n 



iMf •» •« 'n 'm 



mt m •«• 'n •« 



Suppose that you arc writing a report la which you must interpret the 
three graphs shown above. Write the section of that report in which you 
discuss how the graphs are related to each other and explain the con- 
clusions you have reached from the information in the graphs. Be sure 
the graphs support your conclusions. 



THIS SPACE MAY BE USED FOR MOTES. 
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A-5 

TIME - 30 MINUTES 
AREA AND POPUUTION OF CONTINENTS 





AtU 30Z 
Africa 20Z 
Vorth AMrlca liZ 
South AMrica 12S 
Aatareclca fZ 
Ikiropa 71 
Ocaaala iZ 



Asia 5BZ 
litfoH UZ 
Africa m 
Itorch Aaarlca 9Z 
Saucb Aaarlca 5Z 
Oeaaala U 
Antarctica OZ 



Suppose you are to write s report in which you interpret these 
charts. Discuss how the infomstion in the Ares chert is related 
to the information in the Population chart. Explain the conclu- 
sions you have reached from the information in the two charts. 
Be Sure the charts support your conclusions. 



TOU MAY USE THIS SPACE FOR NOTES. 
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Appendix B 



List of Readers for Reading Weekend 
List of Subject Hatter Readers 
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B-1 



Essay Reading Participants 
January 28-29, 1984 



Name 



Affiliation 



State 



Subject 



McNamara, Susan 
Walling, William 



Arena, Louis 
Berezovsky, Helen 
Dyer, Patricia 
Earisman, Delbert 
Lorenzi, Robert 
Morgan, RoseAnn 
Olson, Jerry 
Taylor, Barry 



Baron, Melvyn 
Carew, Pat 
Carty, Kathleen 
David, Elizabeth 
Delia Torre, Thomas 
Emery, Cornelia 
Haffar, Shirley 
Halliday, Cynthia 
Hay, Alice 
Lebowitz, Regina 
Lunt, Ruth 
McDowell, Alfred 
Reilly, Joseph 
Ruiz, Aida 
Sayre, Johanna 
Shanefield, Elizabeth 
Slighton, Margaret 
Stansfield, Charles 
Stewart-Ghali, Denise 
Suoui, Barbara 
Tolo, Marc 
Van Duren, David 
Vlllaneuva, Alfredo 



CHIEF READERS 

WiUiam Paterson College 
Rutgers University 

TABLE LEADERS 

University of Delaware 
University of Pennsylvania 
University of Delaware 
Upsala College 
Camden County College 
Middlesex County College 
Middlesex County College 
University of Pennsylvania 

ESL READERS 

Kingsborottgh Comm. College 
Nyack High School 

"lumbia University 
x\inceton Adult Education 
Bergen Community College 
Rater for Test of Spoken English 
SUNY-New Paltz 
SUNY-New Paltz 
Pennj ;ton School 
N.Y.c/ Technical College 
Rutgers University 
Bergen Community College 
Brooklyn Poly tech 
Hostos Community College 
SUNY-New Paltz 
Princeton Adult School 
Private ESL Tutor 
TOEFL staff 
SUNY-New Paltz 
TOEFL staff 
Pennington School 
Bergen Community College 
Hostos Community College 



NJ 
NJ 



DE 
PA 
DE 
NJ 
NJ 
NJ 
NJ 
PA 



NY 
NY 
NY 
NJ 
NJ 
PA 
NY 
NY 
NJ 
NY 
NJ 
NJ 
NY 
NY 
NY 
NJ 
NJ 
NJ 
NY 
NJ 
NJ 
NJ 
NY 



English 
English 



ESL 

ESL 

ESL 
English 
Ei^lish 
English 
English 

ESL 



ESL 
ESL 
ESL 
ESL 
ESL 
ESL 
ESL 
ESL 
ESL 
ESL 
ESL 
ESL 
ESL 
ESL 
ESL 
ESL 
ESL 
ETS 
ESL 
ETS 
ESL 
ESL 
ESL 



ERIC 
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5^ 



B-2 



Name 



Affiliation 



State Subject 



Asher, Deborah 
Bllllar, Donald 
Busceml, Santl 
Clrasa, Robert 
Collins, John 
Collins, Marilyn 
Conlon, Michael 
Dp.nlels, Barbara 
Edge, Donald 
Granger, Vlrgle 
Gruenberg, Diane 
King, Barbara 
Lees, Irene 
Lutz, William 
Mehlman, Robert 
0*Day, Daniel 
Oszmanskl, Pat 
Otten, Ted 
Palladlno, Mary 
Palmere, Martha 
Piltch, Zlva 
R^hbeln, Edith 
Shea, Michael 



ENGLISH READERS 

Union County Collc^ge 

Union County College 

Middlesex County College 

Kean College of New Jersey 

Glassboro State College 

Glassboro State College 

William Peterson College 

Camden County College 

Camden County College 

Win. Peterson College 

Edison State College 

Rutgers University 

Fellclan College 

Rutgers University 

Tr^Aton State College 

Kean College of Ncr* Jersey 

Department of Higher Education 

Mercer County Community College 

Glassboro State 

West Wlndsor-Plainsboro H.S. 

Pace University 

Middlesex County College 

Mercer County Comm. College 



NJ 


English 


NJ 


English 


NJ 


English 


NJ 


English 


NJ 


English 


NJ 


English 


KJ 


English 


NJ 


English 


NJ 


English 


NJ 


English 


NJ 


English 


NJ 


English 


NJ 


English 


NJ 


English 


NJ 


English 


NJ 


English 


NJ 




NJ 


English 


NJ 


English 


NJ 


English 


NJ 


English 


NJ 


Engl Ish 


NJ 


English 
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RiC 




B-3 



Subject Matter Readers 



Dr. Robert Stover 

Department of Political Science 

University of. .olorado 

Dr. Melvln Oliver 
Department of Sociology 
UCLA 



Dr. Terry Lenz 

Chemical Engineering Department 
Colorado State IMlverslty 



Dr. Robert Hunsperger 

Electrical Engineering Department 

University of Delaware 



Dr. John Trowbridge 

Civil Engineering Department 

University of Delaware 

Dr. Gene Chess on 

Civil Engineering Department 

University of Delaware 

Dr. Douglas Klelber» Ph.D 

Leisure Behavior Research >oratory 

Champaign, IL 

Dr. Richard Fisher 
Department of Education 
Colorado State University 



152 



